Serving a Model with Flask, uWSGI, and Nginx

This tutorial will show how to serve a model using Flask, uWSGI, and Nginx. Additionally, the model will use PyTorch, but it is simple to modify this to work with any ML framework (or for serving something different from an ML model altogether).

Overview

This tutorial will cover:

Creating a simple Flask app that will perform inference
Creating a uWSGI server to serve the Flask app
Using Nginx as a reverse proxy to handle forward requests to the uWSGI server
Repeating the above in a containerized (Docker) environment

In this tutorial we'll be using an image segmentation model, so we'll set up a simple client to send requests with image data. The basic architecture is shown below in Figure 1.

Flow Diagram: Client->Nginx->uWSGI->Flask — Figure 1: Architecture of tutorial

The client sends a request containing image data. Nginx forwards this to the uWSGI server. The uWSGI server then processes the request on one of the instances of the Flask app that is running. To increase throughput, uWSGI will run multiple instances of the Flask app in order to handle requests in parallel (assuming the host machine has multiple cores).

To further increase throughput, uWSGI servers could be running on multiple machines and Nginx would forward requests to them (in a round-robin manner by default). We will spin up multiple uWSGI servers on a single machine to illustrate this.

1. Creating a simple inference Flask app

1a. Setting up a simple server and client

We will set up an endpoint (localhost/infer) which will take a post request containing our data payload, do something with that payload, and then return the transformed data. To start with, let's take in a simple message, modify the message, and return the modified message to make sure the endpoint and client are interacting correctly.

inference.pyfrom flask import Flask, request


app = Flask(__name__)


@app.route('/infer', methods=['POST'])
def infer():
    data_bytes = request.data
    data_string = data_bytes.decode('utf-8')
    data = ' '.join([data_string, "Bye!"])

    return data


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

request.data gets the data from the post request in bytes. We then convert it to a string, add some additional content to the string, and the return it.

The client, which simply sends a message and then prints the modified response is shown below:

client.pyimport requests


def infer(payload):
    url = 'http://localhost:5000/infer'
    response = requests.post(url, data=payload)
    response.raise_for_status()
    return response


data = b'Hi there!'
print(infer(data).text)

First run inference.py. Then run client.py, which will print: Hi there! Bye!

1b. Modifying the client and server for inference (optional)

In the rest of this section we will modify the server and client above to perform inference on a pretrained PyTorch model when an image is sent in the payload. However, this is not necessary to understanding the rest of this tutorial. In both cases (model inference vs the simple message modification above) we send data, do something with the data, and return a response which is a function of the data. Feel free to skip ahead to part 2.

Inference will be performed using a pretrained pixel segmentation model. The following modifications are needed in inference.py:

Load the model
Convert the bytes to a PIL image
Perform the following transforms: Resize to 512x512, convert to a tensor, and normalize the values based on what the model was trained on
Pass the tensor through the model (i.e. inference step
Get the max class for each pixel, convert the tensor to a list and send it to the client

inference.pyfrom flask import Flask, request, jsonify
import torch
from torchvision import transforms
from PIL import Image
import io
import torchvision.models.segmentation as seg


app = Flask(__name__)


@app.route('/infer', methods=['POST'])
def infer():
    data_bytes = request.data
    img = Image.open(io.BytesIO(data_bytes))
    # transform image to tensor and normalize it. These are obtained from the mean and
    # standard deviation of the training data.
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    # set device based on gpu availability
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    # get pretrained model weights
    weights = 'LRASPP_MobileNet_V3_Large_Weights.DEFAULT'
    # load model with weights and put it on device
    model = seg.lraspp_mobilenet_v3_large(weights=weights).to(device)
    # put model in evaluation mode since we're doing inference
    model.eval()
    # The model expects a 4D input (first dimension is batch size), so we unsqueeze
    img_tensor = transform(img).unsqueeze(0).to(device)
    # output is an ordered dictionary with one key, 'out', which maps to a tensor of shape
    # (batch size, number of classes, height, width. Squeeze to remove the batch dimension)
    predictions = model(img_tensor)['out'].squeeze()
    # Get the highest predicted class for each pixel
    classes_tensor = torch.argmax(predictions, dim=0)
    # Convert it to a list which is serializable, then return the serialized object
    classes_list = classes_tensor.tolist()
    return jsonify(classes_list)


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Note that we didn't actually need to call jsonify(classes_list), as Flask will automatically jsonify a list that is returned. But this makes it more explicitly clear what is going on.

Now we will modify the client with the following steps:

Open an image and send the POST request
Overlay the returned segmentation mask on the original image and display it

client.pyimport requests
import numpy as np
from PIL import Image
import io


def infer(payload):
    url = 'http://localhost:5000/infer'
    response = requests.post(url, data=payload)
    response.raise_for_status()
    return response


with open('input_img.jpg', 'rb') as f:
    data = f.read()

response = infer(data)
# This returned list will be the height and width of the input image, and each value
# in this list is an integer corresponding to the class the pixel was segmented into
segmentation_list = response.json()
segmentation_array = np.array(segmentation_list)
# Make an empty array to colorize based on segmentation class
color_array = np.zeros((*segmentation_array.shape, 3), dtype=np.uint8)
# Get the indices of the various pixel segmentation classes
class_values = np.unique(segmentation_array)
# Create random colors that will correspond to each segmentation class
colors = np.random.randint(0, 256, (len(class_values), 3))
# Fill the empty array with the color corresponding to its class
for i, c in enumerate(class_values):
    color_array[segmentation_array == c] = colors[i]
# Overlay the segmentation visual on the original image and show it
seg_img = Image.fromarray(color_array).convert('RGBA')
seg_img.putalpha(180)
orig_img = Image.open(io.BytesIO(data)).convert('RGBA')
blended_img = Image.alpha_composite(orig_img, seg_img)

Running inference.py and then sending a request with client.py with the image of a dog yields the following result:

Figure 2: Input Image, segmentation, segmentation overlaid on input image

Segmentation — Figure 2: Input Image, segmentation, segmentation overlaid on input image

2. Creating a uWSGI server

Now, rather than using the Flask development server, which is not suitable for production for a variety of reasons, we will run the inference.py app on a uWSGI server, which is suitable for production.

2a. Confirming uWSGI installation

The first step is to install uWSGI:

pip install uwsgi

Confirm installation with one of the following:

uwsgi --version

If uwsgi is not recognized, you can try reinstalling with sudo:

sudo pip install uwsgi

2b. Running the server

Make sure you are in the directory where inference.py is located and run the following:

uwsgi --http 0.0.0.0:5000 --wsgi-file inference.py --callable app

This runs the uwsgi command with http, wsgi-file, and callable flags (aka options), providing the socket, filename, and app entry point respectively.

Now that the server has started, run the client script (i.e. client.py) to send a request and have it handled.

2c. Configuring the uWSGI .INI file

It's typically easier to have the options specified in a configuration file instead of entering them in the command line. uWSGI supports may different configuration file types(json, xml, yaml, and ini). .ini files are the most standard.

The .ini configuration file is used to configure (shocking!) all aspects of the uWSGI server, such as which app to run, what socket, number of processes, etc. The following provides an identical configuration to how we just ran the server above:

inferenceapp.ini[uwsgi]
http = 0.0.0.0:5000
wsgi-file = inference.py
callable = app

To run using this configuration file (which we've named inferenceapp.ini):

uwsgi inferenceapp.ini

This assumes inferenceapp.ini is in the same directory as inference.py. If not, we can add an additional option to the .ini file to change the directory:

[uwsgi]
http = 0.0.0.0:5000
wsgi-file = server1-tutorial.py
callable = app
chdir = /home/foobar/myproject/

2d. Common uWSGI flags/options

We'll quickly cover a few of the many options that uWSGI provides. You can view all options here.

processes - Spawn the specified number of workers/processes. This is particularly useful to take advantage of multicore processors to handle requests in parallel.
```
processes = 4
```
master - uWSGI uses a master process that manages all the worker processes. It's useful for things like load balancing between worker processes, respawning worker processes that die, graceful reloading of restarting the application, and monitoring/logging.
```
master = true
```
lazy-apps - Controls when the application is loaded into memory. By default, uWSGI uses a pre-forking model, loading the app before creating worker processes, which can cause issues with apps establishing resources like database connections at startup. When lazy-apps is set to true, uWSGI uses a post-forking model, loading a separate copy of the app for each worker process after they're created, which avoids resource sharing issues but increases memory usage.
As an example, if we loaded the model in the main flask app rather than the specific endpoint, the model would be a shared resource, and unless we set lazy-apps to true, issues for contention of that resource will occur. However, it there's no global sharing then it's recommended to keep lazy-apps to false.
```
lazy-apps = true
```
threads - Enables multithreading by specifying the number of threads per worker process. This is particularly useful for I/O-bound applications, as it allows for concurrency within each worker process. It's not useful for CPU-bound applications since Python's Global Interpreter Lock (GIL) prevents native threads from executing simultaneously in the same program. For CPU-intensive code on multiple cores you must use multiple processes rather than threads.
```
threads = 2
```
enable-threads - Used to enable Python's native threading support. By default, uWSGI runs in single-threaded mode for each worker and disables Python's internal threading capabilities (including the Global Interpreter Lock, or GIL) for performance reasons. When you set enable-threads = true, uWSGI leaves Python's threading capabilities enabled, which means you can use Python's threading or concurrent.futures modules in your application. enable-threads is different than threads = n, where the former enables Python's built-in threading capabilities, and the latter is for handling requests concurrently using uWSGI's worker processes.
```
enable-threads = true
```
http, socket, and http-socket - Control how uWSGI communicates with other services such as a web server of web browser. In all three cases, the argument to the flag is the address and port where uWSGI should listen for connections (e.g. 127.0.0.1:8000 or /tmp/myapp.sock).
- http - Makes uWSGI act as a full HTTP server. When you use the http option, uWSGI can directly accept HTTP requests from a client (like a web browser) and return HTTP responses. This is because http spawns an additional process forwarding requests to a series of workers. Use this if you plan to expose uWSGI directly to the public.
```
http = 127.0.0.1:8000
```
- socket - Useful for deployment where uWSGI is used behind a reverse proxy like Nginx. This option makes uWSGI use the uwsgi protocol, which is more efficient than HTTP. uWSGI can communicate with other services that understand the uwsgi protocol, such as Nginx (or Apache with the appropriate module).
  unix (or local) socket:
```
socket = server1.sock
```
  or for tcp sockets
```
socket = 127.0.0.1:8000
```
- http-socket - Similar to socket, except it uses the HTTP protocol instead of uWSGI. That is, it sets workers to natively speak the http protocol. Like with the socket option, this would be used behind a fully-capable webserver like Nginx or Apache. This is different than the http flag which will spawn a proxy by itself.
```
http-socket = 127.0.0.1:8000
```
harakiri - Destroys processes that are stuck for more than the specified number of seconds. This is a monitor managed by the master process. This is useful to avoid workers getting stuck handling long requsts and hence being unable to accept more requests.
```
harakirk = 20
```

3. Setting up and running Nginx

Now we'll set up Nginx to work as a reverse proxy which forwards all incoming requests to the uWSGI server to be handled.

3a. Installing Nginx

To install Nginx for Debian/Ubuntu run the following (for other installation options view the official documentation):

sudo apt update
sudo apt install nginx

To confirm that it was successfully installed check the version:

nginx -v

3b. Configuring Nginx

When configuring a site with Nginx, you usually create a configuration file for each site in the /etc/nginx/sites-available directory. The filename itself doesn't technically matter and can be anything, but it's common practice to name it after the domain name of the site for clarity. In the /etc/nginx/sites-available directory create a file called inference.conf with the following contents:

inference.confserver {
    listen 5000;
    server_name 0.0.0.0;

    location / {
        include uwsgi_params;
        uwsgi_pass unix:/home/foobar/myproject/inference.sock;
    }
}

After creating this file in the /etc/nginx/sites-available directory, you then create a symbolic link to it from the /etc/nginx/sites-enabled directory. The following command creates the symbolic link:

sudo ln -s /etc/nginx/sites-available/inference.conf /etc/nginx/sites-enabled

Nginx is configured (usually through its main configuration file at /etc/nginx/nginx.conf) to load all configuration files in the /etc/nginx/sites-enabled directory. That's why it knows to use the inference.conf configuration file - because there's a symbolic link to it in the sites-enabled directory.

This setup allows you to easily enable and disable sites by creating and removing symbolic links in the sites-enabled directory, without having to touch the actual configuration files in the sites-available directory. To disable a site, you would simply remove the symbolic link from sites-enabled, and to re-enable it, you would recreate the link.

Since we are connecting Nginx to the uWSGI server using the uwsgi protocol, we need to slightly change the uWSGI .ini configuration file. Replace http = 0.0.0.0:5000 with socket = inference.sock. inference.ini, located in the same directory as inference.py, should now look like this:

inference.ini[uwsgi]
socket = inference.sock
wsgi-file = server1-tutorial.py
callable = app

Note that the socket names and directories need to match in the .ini and .conf files. In our example the socket name is inference.socket and is located in the same directory as the inference.py app.

3c. Running Nginx

Now that the configuration file setup, we can start the Nginx server. But first, run the following handy command which will check for syntax errors in the Nginx config file:

sudo nginx -t

Now that we have confirmed the config file is free of errors, we can start the Nginx server:

sudo service nginx start

Nginx is listening at 0.0.0.0:5000, as specified in our inference.conf file. The following are some useful nginx commands:

service nginx status # displays the status of Nginx server
sudo service nginx stop # stops the Nginx server
sudo service nginx restart # restarts the Nginx server

Next, spin up the uWSGI server as before:

uwsgi inference.ini

Now we can access our endpoint by running the exact same client we've already created previously. Since the client is already sending requests to 0.0.0.0:5000, no changes are needed. Sending the request will first go to Nginx, which then forwards the request to the uWSGI server to perform the task, and send the response back to Nginx which then forwards the response back to the client.

3d. Working with multiple uWSGI servers

Although it's possible to handle multiple requests in parallel by running multiple processes on a machine with multiple cores, this still limits the throughput to the compute capabilities of a single machine. Fortunately, Nginx makes it easy to forward requests to multiple application servers.

The following example sets up Nginx to forward requests to two different uWSGI app servers: a local one using unix sockets and another (which could be local or on a different machine) using TCP sockets. Here is the Nginx config:

inference.confupstream uwsgi_cluster {
    server unix:/home/foobar/myproject/inference.sock;
    server 0.0.0.0:8080;
}

server {
    listen 5000;
    server_name 0.0.0.0;

    location / {
        include uwsgi_params;
        uwsgi_pass uwsgi_cluster;
    }
}

We now need two separate uWSGI .ini config files, one for each server:

inference1.ini[uwsgi]
socket = inference.sock
wsgi-file = inference.py
callable = app

inference2.ini[uwsgi]
socket = 0.0.0.0:8080
wsgi-file = inference.py
callable = app

Nginx will now distribute requests between these two servers. By default, a round-robin algorithm is used to distribute requests among servers.

4. Modifying the above steps for Docker containers

For this part we will have Nginx running in its own container. We will also have the uWSGI running the app in its own container. We will then be able to easily spin up multiple additional uWSGI/app containers. The steps will be as follows:

Set up a simple Nginx container. Then check that it will run and that we can access the endpoint
Set up a uWSGI container with our python app. Then check that it will run and that we can access the endpoint
Change the Nginx and uWSGI configurations so they communicate via TCP and unix domain sockets

4a. Set up a simple Nginx container

First we need to download the Nginx image from Docker Hub:

docker pull nginx

Now we can run a container using this image:

docker run --name nginx-server -p 5000:80 -d nginx

This runs the docker run command with three options: The --name option which tells docker give them image a specific name rather than a randomID, the -p option which tells docker to forward traffic from port 5000 in the host machine to the container's port 80, and the -d option which tells docker to start the container in the background running like a Linux daemon or Windows service (d for detach). Lastly, we supply the name of the image to run (nginx).

We can now go to localhost:5000 in our browser and see the default response from the Nginx server ("Welcome to nginx!").

4b. Set up a uWSGI container with python app

First we need to download a python image. We'll specify a specific version, 3.9:

docker pull python:3.9

Now we need to create a new image based on this python base image. We want to:

Start from the base python 3.9 image
Install the libraries our python app depends on
Install uWSGI

To accomplish these tasks we'll use a dockerfile with the following instructions:

Dockerfile-python# Use an official Python runtime as a parent image
FROM python:3.9

# Set the working directory in the container to /app
WORKDIR /app

# Add the current directory contents into the container at /app
COPY requirements.txt /app
COPY inference.py /app

# Install the packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

requirements.txt is a file which specifies our python libraries. For example:

requirements.txtuWSGI==2.0.21
Pillow==9.5.0
flask==2.2.3
torch==2.0.0
torchvision==0.15.1

We can now build the image:

docker build -f Dockerfile-python -t inference-python .

The -f option specifies the dockerfile to build from, the -t option specifies what we want to tag the built image as, and the last argument is the build context, which is essentially the set of files and directories send to the Docker daemon to build the image. In this case it's simply the current directory.

Now we can spin up the container and test it:

docker run -p 8080:5000 inference-python uwsgi --http 0.0.0.0:5000 --wsgi-file inference.py --callable app

This runs the docker image, binding local port 8080 to container port 5000. It then spins up the uWSGI server using a full HTTP server listening on port 5000. We can now test it by running our client by sending a request to port 8080 on our local machine.

4c. Link Nginx and uWSGI via TCP and unix domain sockets

We'll use our Nginx image as the base to create a new image that includes an additional config file. This config file will include the ip address to the container that will use TCP sockets, as well as the location of the .socket file for the other container that will be using unix domain sockets:

inference.confupstream uwsgi_cluster {
    server unix:/socks/inference1.sock;
    server uwsgi-tcp:8001;
}

server {
    listen 5000;
    server_name 0.0.0.0;

    location / {
        include uwsgi_params;
        uwsgi_pass uwsgi_cluster;
    }
}

The only thing new here is the directory of the unix domain socket (/socks/) and the host name of the tcp socket (uwsgi-tcp). These are simply naming choices I made, and they will be explained in more detail shortly. But first, we need to build the updated Nginx image. Create the following dockerfile named Dockerfile-nginx

Dockerfile-nginxFROM nginx
COPY inference.conf /etc/nginx/conf.d/

This creates a new image identical to the Nginx base image, but with our inference.conf file copied to /etc/nginx/conf.d/ Now build this new image, naming it inference-nginx

docker build -f Dockerfile-nginx -t inference-nginx .

We now have our two necessary containers built. We just need to set up their communication. This can be handled through docker compose. Create the following docker compose file, docker-compose.yaml

docker-compose.yamlversion: '3'
volumes:
  socket_volume:
services:
  nginx:
    image: inference-nginx
    volumes:
      - socket_volume:/socks
    ports:
      - "8080:5000"
    depends_on:
      - uwsgi-tcp
      - uwsgi-uds

  uwsgi-tcp:
    image: inference-python
    command: uwsgi --socket 0.0.0.0:8001 --wsgi-file inference.py --callable app

  uwsgi-uds:
    image: inference-python
    volumes:
      - socket_volume:/socks
    command: uwsgi --socket /socks/inference1.sock --chmod-socket=666 --wsgi-file inference.py --callable app

Let's go through this item by item:

version - Specifies the version of the Docker Compose file format being used. 3 is the most recent
volumes - A volume is a docker-controlled directory for storing data. We are creating a volume named socket_volume, which will be a directory to hold the .sock file for the unix domain socket connection.
services - This is where we list our containers. We're creating one called nginx, using the inference-nginx image we just created. And we're giving it access to the socket_volume we created earlier by mounting this volume in the /socks directory inside this nginx container. We're also mapping 8080 on the local host to 5000 on this container, and saying this container depends on the two uwsgi servers.
uwsgi-tcp - This is the container that will host the tcp uwsgi server. We made the choice to call it uwsgi-tcp, and this is where this name came from in the nginx config file above. We specify commands to run at the start of the container, which is standard uwsgi startup arguments we've already discussed.
uwsgi-uds - This is the container for the unix domain socket. We mount the same volume, socket_volume, to this container, so that the .sock file (which we've chosen to call inference1.sock) that will reside in this volume can be used by both this uWSGI server and Nginx.

Now we can spin everything up and start serving our app, with Nginx choosing a round-robin approach alternating between each uWSGI server when serving client requests:

 docker-compose  up -d

We can access our app by sending requests to localhost:8080, since that's what we mapped to Nginx port 5000. Note that you may need to send requests fast enough in order to observe the round robin behavior, otherwise if requests come in slow enough Nginx may just keep forwarding to the same uWSGI server.

And that's it! Phew.