Better performance by optimizing Gunicorn config

Original link： https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7

Practical advice on how to configure Gunicorn.

TL;DR, For CPU bounded apps increase workers and/or cores. For I/O bounded apps use “pseudo-threads”.

Gunicorn is a Python WSGI HTTP Server that usually lives between a reverse proxy (e.g., Nginx) or load balancer (e.g., AWS ELB) and a web application such as Django or Flask.

Gunicorn architecture

Gunicorn implements a UNIX pre-fork web server.

Great, what does that mean?

Gunicorn starts a single master process that gets forked, and the resulting child processes are the workers.
The role of the master process is to make sure that the number of workers is the same as the ones defined in the settings. So if any of the workers die, the master process starts another one, by forking itself again.
The role of the workers is to handle HTTP requests.
The pre in pre-forked means that the master process creates the workers before handling any HTTP request.
The OS kernel handles load balancing between worker processes.

To improve performance when using Gunicorn we have to keep in mind 3 means of concurrency.

1st means of concurrency (workers, aka UNIX processes)

Each of the workers is a UNIX process that loads the Python application. There is no shared memory between the workers.

The suggested number of workers is (2*CPU)+1.

For a dual-core (2 CPU) machine, 5 is the suggested workers value.

gunicorn --workers=5 main:app

Gunicorn with default worker class (sync). Note the 4th line in the image: “Using worker: sync”.

2nd means of concurrency (threads)

Gunicorn also allows for each of the workers to have multiple threads. In this case, the Python application is loaded once per worker, and each of the threads spawned by the same worker shares the same memory space.

To use threads with Gunicorn, we use the threads setting. Every time that we use threads, the worker class is set to gthread:

gunicorn --workers=5 --threads=2 main:app

Gunicorn with threads setting, which uses the gthread worker class. Note the 4th line in the image: “Using worker: threads”.

The previous command is the same as:

gunicorn --workers=5 --threads=2 --worker-class=gthread main:app

The maximum concurrent requests are workers * threads 10 in our case.

The suggested maximum concurrent requests when using workers and threads is still (2*CPU)+1.

So if we are using a quad-core (4 CPU) machine and we want to use a mix of workers and threads, we could use 3 workers and 3 threads, to get 9 maximum concurrent requests.

gunicorn --workers=3 --threads=3 main:app

3rd means of concurrency (“pseudo-threads” )

There are some Python libraries such as gevent and Asyncio that enable concurrency in Python by using “pseudo-threads” implemented with coroutines.

Gunicorn allows for the usage of these asynchronous Python libraries by setting their corresponding worker class.

Here the settings that would work for a single core machine that we want to run using gevent:

gunicorn --worker-class=gevent --worker-connections=1000 --workers=3 main:app

worker-connections is a specific setting for the gevent worker class.

(2*CPU)+1 is still the suggested workers since we only have 1 core, we’ll be using 3 workers.

In this case, the maximum number of concurrent requests is 3000 (3 workers * 1000 connections per worker)

Concurrency vs. Parallelism

Concurrency is when 2 or more tasks are being performed at the same time, which might mean that only 1 of them is being worked on while the other ones are paused.
Parallelism is when 2 or more tasks are executing at the same time.

In Python, threads and pseudo-threads are a means of concurrency, but not parallelism; while workers are a means of both concurrency and parallelism.

That’s all good theory, but what should I use in my program?

Practical use cases

By tuning Gunicorn settings we want to optimize the application performance.

If the application is I/O bounded, the best performance usually comes from using “pseudo-threads” (gevent or asyncio). As we have seen, Gunicorn supports this programming paradigm by setting the appropriate worker class and adjusting the value of workersto (2*CPU)+1.
If the application is CPU bounded, it doesn’t matter how many concurrent requests are handled by the application. The only thing that matters is the number of parallel requests. Due to Python’s GIL, threads and “pseudo-threads” cannot run in parallel. The only way to achieve parallelism is to increase workers to the suggested (2*CPU)+1, understanding that the maximum number of parallel requests is the number of cores.
If there is a concern about the application memory footprint, using threads and its corresponding gthread worker class in favor of workers yields better performance because the application is loaded once per worker and every thread running on the worker shares some memory, this comes to the expense of some additional CPU consumption.
If you don’t know you are doing, start with the simplest configuration, which is only setting workers to (2*CPU)+1 and don’t worry about threads. From that point, it’s all trial and error with benchmarking. If the bottleneck is memory, start introducing threads. If the bottleneck is I/O, consider a different python programming paradigm. If the bottleneck is CPU, consider using more cores and adjusting the workers value.

Building the system

We, software developers commonly think that every performance bottleneck can be fixed by optimizing the application code, and this is not always true.

There are times in which tuning the settings of the HTTP server, using more resources or re-architecting the application to use a different programming paradigm are the solutions that we need to improve the overall application performance.

In this case, building the system means understanding the types of computing resources (processes, threads and “pseudo-threads”) that we have available to deploy a performant application.

By understanding, architecting and implementing the right technical solution with the right resources we avoid falling into the trap of trying to improve performance by optimizing application code.

References

Gunicorn is ported from Ruby’s Unicorn project. Its design outline helped on clarifying some of the most fundamental concepts. Gunicorn architecture cemented some of those concepts.

Opinionated blog post about how Unicorn deferring some of the most critical features to Unix is good.

Stack Overflow answer about the pre-fork web server model.

Some more references to understand how to fine tune Gunicorn.