More Workers, Slower Progress? Why More Isn’t Always Better for CPU-Bound Celery Tasks

Recently, I was wrestling with some performance issues on my Django website. I’m using Celery to handle background tasks, specifically processing uploaded images using OCR via Tesseract. Initially, things seemed to run smoothly with a few concurrent uploads. However, as the number of simultaneous image uploads increased, the processing time for each file ballooned dramatically, and the server’s CPU would spike to 100%. It was a classic case of “more should be better” turning into “more is definitely worse.”

After some digging and a lot of log analysis (thanks, journalctl!), and of course testing, I realized the culprit: CPU-bound tasks and excessive Celery worker processes.

The Nature of CPU-Bound Tasks

The image processing task, heavily reliant on Tesseract for OCR, is what I call a CPU-bound task. This means that the task’s execution time is primarily limited by the processing power of the server’s CPU. The task spends most of its time performing calculations and computations, rather than waiting for I/O operations like network requests or disk reads.

Think of it like trying to assemble flat-pack furniture. If you have one set of instructions and one person, it takes a certain amount of time. If you have two people and two sets of instructions, you can likely finish in roughly half the time (assuming no major coordination issues).

The Pitfall of Too Many Workers

Now, imagine having five people trying to work on the same set of instructions simultaneously. They’ll constantly be bumping into each other, arguing about who gets to use the screwdriver next, and generally wasting a lot of effort on coordination rather than actual assembly.

This is analogous to what was happening with the Celery workers. By default (or through some initial configuration), Celery was spinning up a number of worker processes. On a 4-core CPU Ubuntu server, I’m seeing five Celery worker processes trying to tackle the CPU-intensive OCR tasks concurrently.

While the operating system tries to share the CPU time among these processes, the constant switching between them (context switching) introduces significant overhead. Each process has to load its state into the CPU, perform a small amount of work, and then get swapped out for another process. When you have more processes vying for CPU time than you have actual CPU cores, this context switching becomes the bottleneck. The CPU ends up spending more time managing the workers than actually executing the OCR.

This explained why processing a single image might take just a few seconds when the system was relatively idle, but when multiple uploads hit simultaneously, each task would take minutes, and the CPU would be maxed out just trying to juggle the overloaded workers.

The Solution: Limiting Concurrency

The fix was surprisingly simple: limit the number of Celery worker processes to be closer to the number of available CPU cores. I achieved this by modifying the Celery systemd service file to include the -c or --concurrency option.

Here’s the snippet from the updated service file:

[Service]
# ... other configurations ...
ExecStart=/home/public_html/django/folder/bin/celery -A project worker --loglevel=info -c 1
# ...

By setting -c 1, I instructed Celery to run only two worker processes. On the 4-core machine, this allows for better utilization of the CPU without excessive context switching. I found that this significantly improved the processing time for concurrent uploads and kept the CPU utilization at a much more manageable level.

Key Takeaways

  • Identify your task type: Understand whether your Celery tasks are CPU-bound or I/O-bound.
  • CPU-bound tasks benefit from concurrency close to the number of CPU cores: More workers than cores can lead to performance degradation due to context switching overhead.
  • Use the -c or --concurrency option: Control the number of Celery worker processes to optimize for your specific workload and server resources.
  • Monitor your system: Keep an eye on CPU utilization and task processing times to fine-tune your Celery worker configuration.

In conclusion, while the idea of more workers might seem intuitively better for handling more concurrent tasks, it’s crucial to consider the nature of those tasks. For CPU-bound operations like OCR, aligning the number of workers with your CPU core count is often the key to achieving optimal performance and preventing your server from being overwhelmed.