[Translation] Linux network application performance. Introduction

[Translation] Linux network application performance. Introduction

Web applications are now widely used, and among all transport protocols, the lion's share is HTTP. Studying the nuances of developing web applications, the majority devotes very little attention to the operating system, where these applications actually run. Separation of development (Dev) and operation (Ops) only worsened the situation. But with the spread of DevOps culture, developers are beginning to take responsibility for running their applications in the cloud, so it’s very useful for them to thoroughly get acquainted with the operating system backend. This is especially useful if you are trying to deploy a system for thousands or tens of thousands of simultaneous connections.

The limitations in web services are very similar to those in other applications. Whether they are load balancers or database servers, all of these applications have similar problems in a high-performance environment. Understanding these fundamental limitations and ways to overcome them in general will allow you to evaluate the performance and scalability of your web applications.

I am writing this series of articles in response to questions from young developers who want to become well-informed system architects. It is impossible to clearly understand the methods of optimizing Linux applications, not immersed in the basics, how they work at the operating system level. Although there are many types of applications, in this cycle I want to explore network applications, rather than desktop ones, such as a browser or text editor. This material is intended for developers and architects who want to understand how Linux or Unix programs work and how to structure them for high performance.

Linux is the server operating system, and most often your applications run on this OS. Although I say “Linux”, most of the time you can safely assume that all Unix-like operating systems are meant as a whole. However, I have not tested the accompanying code on other systems. So, if you are interested in FreeBSD or OpenBSD, the result may differ. When I try something Linux-specific, I point it out.

Although you can use this knowledge to create an application from scratch, and it will be perfectly optimized, but it is better not to do so. If you write a new C or C ++ web server for your organization’s business application, it may be your last day at work. However, knowledge of the structure of these applications will help in the selection of existing programs. You will be able to compare systems based on processes with systems based on threads as well as based on events. You will understand and appreciate why Nginx works better than Apache httpd, why a Tornado-based Python application can serve more users than a Django-based Python application.

ZeroHTTPd: training tool

ZeroHTTPd is a web server that I wrote from scratch in C as an educational tool. He has no external dependencies, including access to Redis. We run our own Redis routines. See below for details.

Although we could discuss the theory for a long time, there is nothing better than writing code, running it and comparing all the server architectures with each other. This is the most visual method. Therefore, we will write a simple ZeroHTTPd web server, applying each model: based on processes, threads and events. Let's check each of these servers and see how they work compared to each other. ZeroHTTPd is implemented in a single C file. The event-based server includes uthash , an excellent implementation of the hash table that comes in one header file. In other cases, there are no dependencies, so as not to complicate the project.

There are a lot of comments in the code to help you figure it out.Being a simple web server in a few lines of code, ZeroHTTPd is also a minimal framework for web development. It has limited functionality, but it is capable of generating static files and very simple “dynamic” pages. I must say that ZeroHTTPd is well suited for learning how to create high-performance Linux applications. By and large, most web services are waiting for requests, check them and process them. This is exactly what ZeroHTTPd will do. This is a tool for learning, not for production. He is not good at error handling and is unlikely to boast the best security practices (oh yeah, I used strcpy ) or clever tricks of C. But I hope he does his job well.

ZeroHTTPd home page. It can produce different types of files, including images

Guestbook application

Modern web applications are usually not limited to static files. They have complex interactions with various databases, caches, etc. Therefore, we will create a simple web application called Guestbook, where visitors leave entries under their own names. In the guest book saved entries left earlier. There is also a visitor counter at the bottom of the page.

Guest Book Web Application ZeroHTTPd

Visitor counters and guestbook entries are stored in Redis. Own procedures are implemented for communications with Redis, they do not depend on an external library. I'm not a big fan of rolling out homebrew code when there are generally available and well-tested solutions. But the goal of ZeroHTTPd is to study Linux performance and access to external services, while serving HTTP requests has a serious impact on performance. We must fully control the communication with Redis in each of our server architectures. In one architecture, we use blocking calls, in others, event-based procedures. Using the Redis external client library will not give such control. In addition, our little Redis client performs only a few functions (getting, setting and increasing the key; getting and adding to the array). In addition, the Redis protocol is extremely elegant and simple. He doesn’t even need to teach him. The fact that the protocol does all the work in about a hundred lines of code indicates how well-thought it is.

The following figure shows the application's actions when a client (browser) requests /guestbookURL .

The mechanism of the guest book application

When you need to issue a guestbook page, there is one call to the file system to read the template into memory and three network calls to Redis. The template file contains most of the HTML content for the page in the screenshot above. There are also special placeholders for the dynamic part of the content: records and visitor counters. We get them from Redis, paste them into the page and give the client the fully formed content. The third Redis call can be avoided because Redis returns the new key value when incremented. However, for our server with an asynchronous, event-based architecture, numerous network calls are a good test for training purposes. Thus, we discard the return value of Redis about the number of visitors and request it with a separate call.

ZeroHTTPd Server Architectures

We are building seven versions of ZeroHTTPd with the same functionality, but different architectures:

  • Iterative
  • Fork server (one child per request)
  • Pre-fork server (pre-forking process)
  • Server with execution threads (one thread per request)
  • Server with pre-threading
  • Architecture based on poll ()
  • Architecture based on epoll

We measure the performance of each architecture by loading the server with HTTP requests. But when comparing architectures with a high degree of parallelism, the number of queries increases. We test three times and consider the average.

Testing Methodology

ZeroHTTPd load test setup

It is important that when performing tests all components do not work on the same machine. In this case, the OS incurs additional planning overhead, as the components compete for the CPU. Measuring the operating system overhead for each of the selected server architectures is one of the most important goals of this exercise. Adding more variables will be detrimental to the process. Therefore, the setting in the picture above works best.

What each of these servers do

  • load.unixism.net: here we run ab , the Apache Benchmark utility. It generates the workload required to test our server architectures.
  • nginx.unixism.net: sometimes we want to run more than one instance of the server program. For this, the Nginx server with the appropriate settings works as a load balancer from ab to our server processes.
  • zerohttpd.unixism.net: here we run our server programs on seven different architectures, one at a time.
  • redis.unixism.net: the Redis daemon is running on this server, where guestbook entries and a visitor counter are stored.

All servers run on the same processor core. The idea is to evaluate the maximum performance of each of the architectures. Since all server programs are tested on the same hardware, this is the basic level for comparing them. My test setup consists of virtual servers rented from Digital Ocean.

What do we measure?

You can measure different indicators. We evaluate the performance of each architecture in this configuration by loading servers with requests at different levels of parallelism: the load grows from 20 to 15,000 simultaneous users.

Test Results

The following diagram shows the performance of servers on different architectures with different levels of parallelism. On the y axis - the number of requests per second, on the x axis - parallel connections.

Below is a table with the results.

requests per second
parallelism iterative fork pre-fork streaming Pre-streaming poll epoll
20 7 112 2100 1800 2250 1900 2050
50 7 190 2200 1700 2200 2000 2000
100 7 245 2200 1700 2200 2150 2100
200 7 330 2300 1750 2300 2200 2100
300 - 380 2200 1800 2400 2250 2150
400 - 410 2200 1750 2600 2000 2000
500 - 440 2300 1850 2700 1900 2212
600 - 460 2400 1800 2500 1700 2519
700 - 460 2400 1600 2490 1550 2607
800 - 460 2400 1600 2540 1400 2553
900 - 460 2300 1600 2472 1200 2567
1000 - 475 2300 1700 2485 1150 2439
1500 - 490 2400 1550 2620 900 2479
2000 - 350 2400 1400 2396 550 2200
2500 - 280 2100 1300 2453 490 2262
3000 - 280 1900 1250 2502 large scatter 2138
5000 - large scatter 1600 1100 2519 - 2235
8000 - - 1200 large scatter 2451 - 2100
10,000 - - large scatter - 2200 - 2200
11,000 - - - - 2200 - 2122
12,000 - - - - 970 - 1958
13,000 - - - - 730 - 1897
14,000 - - - - 590 - 1466
15,000 - - - - 532 - 1281

From the graph and the table it is clear that above 8000 simultaneous requests, we have only two players left: pre-fork and epoll. As the load grows, a server based on poll works worse than a streaming one. The pre-threading architecture is worthy of an epoll: this is evidence of how well the Linux kernel plans a large number of threads.

ZeroHTTPd Source Code

The ZeroHTTPd source code is here . There is a separate directory for each architecture.

 01── 01_iterative
 │ ├── main.c
 02── 02_forking
 │ ├── main.c
 03── 03_preforking
 │ ├── main.c
 04── 04_threading
 │ ├── main.c
 05── 05_prethreading
 │ ├── main.c
 06── 06_poll
 │ ├── main.c
 07── 07_epoll
 │ └── main.c
 Make── Makefile
 │ ├── index.html
 │ └── tux.png
 Templates── templates
  Guest── guestbook
  Index── index.html 

In addition to the seven directories for all architectures, there are two more in the top-level directory: public and templates. The first one contains the index.html file and the image from the first screenshot. You can put other files and folders there, and ZeroHTTPd should easily issue these static files. If the path in the browser corresponds to the path in the public folder, then ZeroHTTPd searches the index.html file in this directory. Content for the guest book is generated dynamically. It has only the main page, and its content is based on the 'templates/guestbook/index.html' file. Dynamic pages for extensions are easily added to ZeroHTTPd. The idea is that users can add templates to this directory and extend ZeroHTTPd as needed.

To build all seven servers, run make all from the top level directory - and all builds will appear in this directory. Executable files look for public and templates directories in the directory from which they are launched.

Linux API

To understand the information in this series of articles, it is not necessary to understand the Linux API well. However, I recommend reading more on this topic, there are many reference resources on the Web. Although we will cover several categories of the Linux API, our focus will be mainly on processes, threads, events, and the network stack. In addition to books and articles about the Linux API, I also recommend reading mana for system calls and library functions used.

Performance and Scalability

One note about performance and scalability. Theoretically, there is no connection between them. You may have a web service that works very well, with a response time of a few milliseconds, but it does not scale at all. Similarly, there may be a poorly functioning web application that takes a few seconds to respond, but it scales to tens to handle tens of thousands of simultaneous users. However, the combination of high performance and scalability is a very powerful combination. High-performance applications generally save resources and, therefore, effectively serve more concurrent users on the server, reducing costs.

CPU and I/O tasks

Finally, in calculations there are always two possible types of tasks: for I/O and CPU. Getting requests via the Internet (network I/O), file serving (network and disk I/O), communication with the database (network and disk I/O) are all I/O actions. Some queries to the database may slightly load the CPU (sorting, calculating the average value of a million results, etc.). Most web applications are limited to the maximum possible I/O, and the processor is rarely used at full capacity. When you see that a lot of CPUs are used in some I/O task, this is most likely a sign of a poor application architecture. This may mean that CPU resources are spent on managing processes and context switching - and this is not entirely useful. If you do something like image processing, audio file conversion or machine learning, then the application requires powerful CPU resources. But for most applications this is not the case.

More about server architectures

  1. Part I. Iterative architecture
  2. Part II. Fork servers
  3. Part III. Pre-fork servers
  4. Part IV. Runtime Servers
  5. Part V. Servers with preliminary thread creation
  6. Part VI. Poll-based architecture
  7. Part VII. Epoll-based architecture

Source text: [Translation] Linux network application performance. Introduction