Occasionally, about 1 in 50, it runs on the letter-worker, causing it to run out of memory which restarts the application killing any running task on either worker with it.
I don’t know how it is possible or how to fix it.
Any help of suggestions on how to fix or debug it would be appreciated.
To ensure that your tasks are being routed correctly to the intended queues and to prevent your letter-worker from processing tasks meant for the celery queue, you need to adjust a few configurations in your setup.
Fixing the Procfile
Your current Procfile has a small issue with the way the queues are specified. You should not need the -X flag in the letter-worker definition, as it restricts the worker from processing tasks that belong to other queues, but you’ve already specified -Q letters.
Here’s an updated version of your Procfile:
worker: REMAP_SIGTERM=SIGQUIT celery -A shareforce.taskapp worker --loglevel=info --concurrency=3 --prefetch-multiplier=2 -Q letters,celery letter-worker: REMAP_SIGTERM=SIGQUIT celery -A shareforce.taskapp worker --loglevel=inf
Configuring Task Queues Properly
Make sure that your tasks are explicitly defined for the correct queue. You already have the generate_export task set to the celery queue, which is good. You should ensure that any other tasks that should go to the letters queue are defined similarly.
Debugging Task Routing
Check Task Routing: Make sure that there are no other configurations or tasks inadvertently sending messages to the letters queue. You can log or print the task routing in your application to ensure that tasks are going to the right queue.
Enable Worker Logging: Increase the verbosity of your worker logs to capture more information about the tasks being processed. You can change the log level to debug for more detailed logs:
worker: REMAP_SIGTERM=SIGQUIT celery -A shareforce.taskapp worker --loglevel=debug --concurrency=3 --prefetch-multiplier=2 -Q letters,celery letter-worker: REMAP_SIGTERM=SIGQUIT celery -A shareforce.taskapp worker --loglevel
Inspect Redis: Use a Redis client to inspect the queues directly. You can check if tasks are being queued in the wrong place. Commands like LRANGE can help you inspect the contents of the queues.
Task Acknowledgment: Ensure that tasks are being acknowledged properly. If a task fails or runs out of memory, it might be retried on another worker. Make sure you have proper error handling and logging to catch any issues.
Inspect Celery Version: Ensure you’re using a compatible version of Celery with Redis, as sometimes task routing issues can stem from version incompatibility.
Final Recommendations
If after these adjustments you still experience tasks being processed by the wrong worker, consider:
Isolation: If feasible, run the letter-worker in a completely isolated environment (different Redis instance or even a different app) to ensure no cross-contamination between the two worker types.
Monitoring: Use a monitoring tool for Redis and Celery to gain insights into the task flow and potential issues.
Retry Policies: Implement a retry policy for tasks that are prone to failure, making sure you can handle them gracefully rather than crashing the worker.