Practical advice on how to configure Gunicorn.
TL;DR, For CPU bounded apps increase workers and/or cores. For I/O bounded apps use “pseudo-threads”.
Gunicorn implements a UNIX pre-fork web server.
Great, what does that mean?
- Gunicorn starts a single master process that gets forked, and the resulting child processes are the workers.
- The role of the master process is to make sure that the number of workers is the same as the ones defined in the settings. So if any of the workers die, the master process starts another one, by forking itself again.
- The role of the workers is to handle HTTP requests.
- The pre in pre-forked means that the master process creates the workers before handling any HTTP request.
- The OS kernel handles load balancing between worker processes.
To improve performance when using Gunicorn we have to keep in mind 3 means of concurrency.
Each of the workers is a UNIX process that loads the Python application. There is no shared memory between the workers.
The suggested number of workers is
For a dual-core (2 CPU) machine, 5 is the suggested workers value.
gunicorn --workers=5 main:app
Gunicorn with default worker class (sync). Note the 4th line in the image: “Using worker: sync”.
Gunicorn also allows for each of the workers to have multiple threads. In this case, the Python application is loaded once per worker, and each of the threads spawned by the same worker shares the same memory space.
To use threads with Gunicorn, we use the threads setting. Every time that we use threads, the worker class is set to gthread:
gunicorn --workers=5 --threads=2 main:app
Gunicorn with threads setting, which uses the gthread worker class. Note the 4th line in the image: “Using worker: threads”.
The previous command is the same as:
gunicorn --workers=5 --threads=2 --worker-class=gthread main:app
The maximum concurrent requests are
workers * threads 10 in our case.
The suggested maximum concurrent requests when using workers and threads is still
So if we are using a quad-core (4 CPU) machine and we want to use a mix of workers and threads, we could use 3 workers and 3 threads, to get 9 maximum concurrent requests.
gunicorn --workers=3 --threads=3 main:app
There are some Python libraries such as gevent and Asyncio that enable concurrency in Python by using “pseudo-threads” implemented with coroutines.
Gunicorn allows for the usage of these asynchronous Python libraries by setting their corresponding worker class.
Here the settings that would work for a single core machine that we want to run using
gunicorn --worker-class=gevent --worker-connections=1000 --workers=3 main:app
worker-connections is a specific setting for the gevent worker class.
(2*CPU)+1 is still the suggested
workers since we only have 1 core, we’ll be using 3 workers.
In this case, the maximum number of concurrent requests is 3000 (3 workers * 1000 connections per worker)
- Concurrency is when 2 or more tasks are being performed at the same time, which might mean that only 1 of them is being worked on while the other ones are paused.
- Parallelism is when 2 or more tasks are executing at the same time.
In Python, threads and pseudo-threads are a means of concurrency, but not parallelism; while workers are a means of both concurrency and parallelism.
That’s all good theory, but what should I use in my program?
By tuning Gunicorn settings we want to optimize the application performance.
- If the application is I/O bounded, the best performance usually comes from using “pseudo-threads” (gevent or asyncio). As we have seen, Gunicorn supports this programming paradigm by setting the appropriate worker class and adjusting the value of workersto
- If the application is CPU bounded, it doesn’t matter how many concurrent requests are handled by the application. The only thing that matters is the number of parallel requests. Due to Python’s GIL, threads and “pseudo-threads” cannot run in parallel. The only way to achieve parallelism is to increase
workersto the suggested
(2*CPU)+1, understanding that the maximum number of parallel requests is the number of cores.
- If there is a concern about the application memory footprint, using threads and its corresponding gthread worker class in favor of workers yields better performance because the application is loaded once per worker and every thread running on the worker shares some memory, this comes to the expense of some additional CPU consumption.
- If you don’t know you are doing, start with the simplest configuration, which is only setting workers to (2*CPU)+1 and don’t worry about threads. From that point, it’s all trial and error with benchmarking. If the bottleneck is memory, start introducing threads. If the bottleneck is I/O, consider a different python programming paradigm. If the bottleneck is CPU, consider using more cores and adjusting the workers value.
We, software developers commonly think that every performance bottleneck can be fixed by optimizing the application code, and this is not always true.
There are times in which tuning the settings of the HTTP server, using more resources or re-architecting the application to use a different programming paradigm are the solutions that we need to improve the overall application performance.
In this case, building the system means understanding the types of computing resources (processes, threads and “pseudo-threads”) that we have available to deploy a performant application.
By understanding, architecting and implementing the right technical solution with the right resources we avoid falling into the trap of trying to improve performance by optimizing application code.
Opinionated blog post about how Unicorn deferring some of the most critical features to Unix is good.
Stack Overflow answer about the pre-fork web server model.
Some more references to understand how to fine tune Gunicorn.