Celery tasks run on the wrong worker occasionally

Arthur · September 19, 2024, 5:13am

I’m using Django with celery on Heroku with Redis. I have two queues, called ‘celery’ and ‘letters’

I have two worker types on Heroku

worker (Dyno with lots of RAM). It can run tasks from both queues
letter-worker is a smaller instance: It should only process tasks from the ‘letters’ queue.

My heroku procfile looks as follows:

worker: REMAP_SIGTERM=SIGQUIT celery -A shareforce.taskapp worker --loglevel=info --concurrency=3 --prefetch-multiplier=2 -Q letters,celery
letter-worker: REMAP_SIGTERM=SIGQUIT celery -A shareforce.taskapp worker --loglevel=info --concurrency=2 --prefetch-multiplier=2 -Q letters -X celery

In my config I have this setting

CELERY_TASK_DEFAULT_QUEUE = "celery"

I have the following celery task:

@shared_task(queue="celery")
def generate_export(**kwargs):
    pass

I call it like this:

generate_export.delay()

Occasionally, about 1 in 50, it runs on the letter-worker, causing it to run out of memory which restarts the application killing any running task on either worker with it.

I don’t know how it is possible or how to fix it.

Any help of suggestions on how to fix or debug it would be appreciated.

Llouis · September 30, 2024, 9:01am

To ensure that your tasks are being routed correctly to the intended queues and to prevent your letter-worker from processing tasks meant for the celery queue, you need to adjust a few configurations in your setup.

Fixing the Procfile

Your current Procfile has a small issue with the way the queues are specified. You should not need the -X flag in the letter-worker definition, as it restricts the worker from processing tasks that belong to other queues, but you’ve already specified -Q letters.

Here’s an updated version of your Procfile:

worker: REMAP_SIGTERM=SIGQUIT celery -A shareforce.taskapp worker --loglevel=info --concurrency=3 --prefetch-multiplier=2 -Q letters,celery letter-worker: REMAP_SIGTERM=SIGQUIT celery -A shareforce.taskapp worker --loglevel=inf

Configuring Task Queues Properly

Make sure that your tasks are explicitly defined for the correct queue. You already have the generate_export task set to the celery queue, which is good. You should ensure that any other tasks that should go to the letters queue are defined similarly.

Debugging Task Routing

Check Task Routing: Make sure that there are no other configurations or tasks inadvertently sending messages to the letters queue. You can log or print the task routing in your application to ensure that tasks are going to the right queue.
Enable Worker Logging: Increase the verbosity of your worker logs to capture more information about the tasks being processed. You can change the log level to debug for more detailed logs:

worker: REMAP_SIGTERM=SIGQUIT celery -A shareforce.taskapp worker --loglevel=debug --concurrency=3 --prefetch-multiplier=2 -Q letters,celery letter-worker: REMAP_SIGTERM=SIGQUIT celery -A shareforce.taskapp worker --loglevel

Inspect Redis: Use a Redis client to inspect the queues directly. You can check if tasks are being queued in the wrong place. Commands like LRANGE can help you inspect the contents of the queues.
Task Acknowledgment: Ensure that tasks are being acknowledged properly. If a task fails or runs out of memory, it might be retried on another worker. Make sure you have proper error handling and logging to catch any issues.
Inspect Celery Version: Ensure you’re using a compatible version of Celery with Redis, as sometimes task routing issues can stem from version incompatibility.

Final Recommendations

If after these adjustments you still experience tasks being processed by the wrong worker, consider:

Isolation: If feasible, run the letter-worker in a completely isolated environment (different Redis instance or even a different app) to ensure no cross-contamination between the two worker types.
Monitoring: Use a monitoring tool for Redis and Celery to gain insights into the task flow and potential issues.
Retry Policies: Implement a retry policy for tasks that are prone to failure, making sure you can handle them gracefully rather than crashing the worker.