Hello! My name is Sergey Kostanbaev, on the Exchange I am developing the core of the trading system.
When in the Hollywood films show the New York Stock Exchange, it always looks like this: crowds of people, everyone is yelling something, waving pieces of paper, chaos is going on. We have never had this on the Moscow Stock Exchange, because from the very beginning trading is conducted electronically and is based on two main platforms - Spectra (derivatives market) and ASTS (currency, stock and money markets). And today I want to talk about the evolution of the architecture of the ASTS trading and clearing system, about various solutions and findings. The story will be long, so I had to break it into two parts.
We are one of the few exchanges in the world where all types of assets are traded and a full range of exchange services is provided. For example, last year we ranked second in the world in terms of bond trading, 25th among all stock exchanges, 13th by capitalization among public stock exchanges.
For professional bidders, such parameters as response time, stability of time distribution (jitter) and reliability of the whole complex are critical. We currently handle tens of millions of transactions per day. The processing of each transaction by the core of the system takes dozens of microseconds. Of course, with cellular operators on the New Year or from search engines, the load itself is higher than ours, but very few people can compare with us in loading with the above-mentioned characteristics, as it seems to me. At the same time, it is important for us that the system does not slow down for a second, it works absolutely stably, and all users are on an equal footing.
A bit of history
In 1994, the Australian ASTS system was launched at the Moscow Interbank Currency Exchange (MICEX), and from this point on you can count down the Russian history of electronic trading. In 1998, the architecture of the exchange was upgraded for the introduction of online trading. Since then, the speed of introducing new solutions and architectural changes in all systems and subsystems is only gaining momentum.
In those years, the exchange system worked on hi-end hardware - HP Superdome 9000 ultra-reliable servers (built on the architecture of PA-RISC
), which duplicated absolutely everything: I/O subsystems, network, RAM (in fact, there was a RAID array from RAM), processors (hot swapping was supported). It was possible to change any component of the server without stopping the machine. We relied on these devices, considered them virtually unreliable. The operating system was a Unix-like HP UX system.
But since about 2010, such a phenomenon as high-frequency trading (HFT), or high-frequency trading - quite simply, exchange robots has emerged. In just 2.5 years, the load on our servers has increased 140 times.
To withstand such a load with the old architecture and equipment was impossible. It was necessary to somehow adapt.
Requests to the exchange system can be divided into two types:
- Transactions. If you want to buy dollars, stocks or something else, then send a transaction to the trading system and get an answer about success.
- Information requests. If you want to find out the current price, see the order book or indices, then send information requests.
Schematically, the core of the system can be divided into three levels:
- Client level at which brokers work, clients. They all interact with access servers.
- Access servers (Gateway) are caching servers that process all information requests locally. Do you want to know at what price Sberbank shares are being traded? The request goes to the access server.
- But if you want to buy stocks, then the request goes to the central server (Trade Engine). There are one such server for each type of market, they play an important role, it was for them that we created this system.
The core of the trading system is a tricky in-memory database in which all transactions are exchange transactions. The database was written in C, only the libc library was available from external dependencies and there was no dynamic memory allocation at all. To reduce processing time, the system starts with a static array of arrays and with static data relocation: first, all data for the current day is loaded into memory, and no further disk access is performed, all work is done only in memory. When you start the system, all reference data is already sorted, so the search works very efficiently and takes little time at runtime. All tables are made with intrusive lists and trees for dynamic data structures so that they do not require memory allocation at runtime.
Let's take a quick look at the history of our trading and clearing system.
The first version of the architecture of the trading and clearing system was built on the so-called Unix-interaction: shared memory, semaphores and queues were used, and each process consisted of a single thread. This approach was widespread in the early 1990s.
The first version of the system contained two levels of Gateway and the central server of the trading system. The scheme of work was this:
- A client sends a request that goes to Gateway. It checks the validity of the format (but not the data itself) and discards incorrect transactions.
- If an information request was sent, then it is executed locally; if it is a transaction, it is redirected to the central server.
- Then the trading engine processes the transaction, changes the local memory and sends the response to the transaction, and its own to the replication using a separate replication mechanism.
- The Gateway receives a response from the central site and forwards it to the client.
- After a while, Gateway receives a transaction using a replication mechanism, and this time it executes it locally, modifying its data structures so that the following information requests display the actual data.
In fact, the replication model is described here, in which Gateway completely repeated the actions performed in the trading system. A separate replication channel provided the same transaction execution order across multiple access nodes.
Since the code was single-threaded, a classical scheme with fork processes was used to serve a multitude of clients. However, doing fork for the entire database was very costly, so lightweight service processes were used that collected packets from TCP sessions and put them into one queue (SystemV Message Queue). Gateway and Trade Engine worked only with this queue, taking transactions for execution from there. It was already impossible to send an answer to it, because it is not clear which service process should read it. So we resorted to the trick: each fork-nuty process created a response queue for itself, and when a request came in the incoming queue, a tag for the response queue was immediately added to it.
Constant copying from a queue to a queue of large amounts of data created problems, especially for information requests. Therefore, we used another trick: in addition to the response queue, each process also created a system memory (SystemV Shared Memory).The packages themselves were placed in it, and only the tag was kept in the queue, which allows you to find the source package. This helped save data to the processor's cache.
SystemV IPC includes utilities for viewing the status of queue objects, memory, and semaphores. We actively used this to understand what is happening in the system at a particular moment, where packets accumulate, what is in blocking, etc.
First of all, we got rid of the single-process Gateway. His significant drawback was that he could process either one replication transaction or one information request from a client. And as the load increases, Gateway will process requests longer and will not be able to handle the replication stream. In addition, if the client has sent a transaction, then you only need to check its validity and redirect it further. Therefore, we have replaced one Gateway process with a multitude of components that can work in parallel: multi-threaded information and transactional processes that work independently from each other with a common memory area using RW-blocking. And at the same time they implemented dispatch and replication processes.
The effect of high frequency trading
The above version of the architecture existed until 2010. In the meantime, the performance of the HP Superdome servers no longer satisfied us. In addition, the PA-RISC architecture actually died, the vendor did not offer any significant updates. As a result, we started to switch from HP UX/PA RISC to Linux/x86. The transition began with the adaptation of access servers.
Why did we have to change architecture again? The fact is that high-frequency trading has significantly changed the load profile on the system core.
Suppose we have a small transaction that caused a significant price change - someone bought half a billion dollars. After a couple of milliseconds, all market participants notice this and begin to give a correction. Naturally, requests are lined up in a huge queue, which the system will rake for a long time.
At this interval of 50 ms, the average speed is about 16 thousand transactions per second. If we reduce the window to 20 ms, then we will get an average speed of 90 thousand transactions per second, and at the peak there will be 200 thousand transactions. In other words, the load is intermittent, with sharp bursts. And the queue of requests must always be processed quickly.
But why is there a queue at all? So, in our example, many users noticed a price change and send the corresponding transactions. Those come to Gateway, he serializes them, sets a certain order and sends them to the network. Routers mix packets and send them further. Whose package came before, that transaction and "won." As a result, clients of the exchange began to notice that if the same transaction was sent from several Gateway, then the chances of its quick processing increase. Soon, stock robots began to bombard Gateway with requests, and an avalanche of transactions arose.
New evolution round
After extensive testing and research, we switched to the real-time kernel of the operating system. For this, we chose RedHat Enterprise MRG Linux, where MRG stands for messaging real-time grid. The advantage of real-time-patches is that they optimize the system for the fastest possible execution: all processes are lined up in a FIFO queue, kernels can be isolated, no ejections, all transactions are processed in strict sequence.
Red - work with the queue in the usual core, green - work in the real-time core.
But it’s not so easy to achieve low latency on regular servers:
- The SMI mode, which in the x86 architecture is the basis for working with important peripherals, strongly interferes. Processing of various hardware events and the management of components and devices is performed by the firmware in the so-called transparent SMI mode, in which the operating system does not see what the firmware is doing. As a rule, all large vendors offer special extensions for firmware servers, which reduce the amount of SMI processing.
- There should be no dynamic CPU frequency control, this leads to additional downtime.
- When the file system log is reset, certain processes occur in the kernel that cause unpredictable delays.
- Pay attention to things like CPU Affinity, Interrupt affinity, NUMA.
It should be noted that the topic of setting up hardware and the Linux kernel for realtime processing deserves a separate article. We spent a lot of time on experiments and research before we achieved a good result.
When switching from PA-RISC-servers to x86, we practically did not have to change the system code much, we just adapted and reconfigured it. At the same time corrected several bugs. For example, the consequences of PA RISC being a Big endian system, and x86 - Little endian, quickly surfaced: for example, the data was read incorrectly. A more cunning bug was that PA RISC uses consistently
( Sequential consistent
) memory access, while x86 can reorder read operations therefore, a code that is absolutely valid on one platform has become inoperative on another.
After switching to x86, the performance increased almost three times, the average transaction processing time decreased to 60 μs.
Let's now take a closer look at what key changes have been made to the system architecture.
Hot Standby Epic
Moving to the commodity-servers, we were aware that they are less reliable. Therefore, when creating a new architecture, we a priori assumed the possibility of the failure of one or several nodes. Therefore, a hot backup system was needed that could very quickly switch to backup machines.
In addition, there were other requirements:
- You can never lose processed transactions.
- The system must be completely transparent to our infrastructure.
- Clients do not have to see the connection breaks.
- Reservations should not introduce a significant delay, because it is a critical factor for the exchange.
When creating a hot backup system, we did not consider such scenarios as double failures (for example, the network stopped working on one server and the main server hung); did not consider the possibility of errors in the software, because they are detected during testing; and did not consider the malfunction of iron.
As a result, we come to the following scheme:
- The main server directly communicated with the Gateway servers.
- All transactions arriving at the main server instantly replicated to the backup server via a separate channel. The Arbitrator (Governor) coordinated the switch in case of any problems.
- The main server processed each transaction and waited for confirmation from the backup server.To delay the minimum, we refused to wait for the transaction on the backup server. Since the duration of the movement of the transaction over the network was comparable to the duration of the execution, no additional delay was added.
- We could only check the processing status of the main and backup server for the previous transaction, and the processing status of the current transaction was unknown. Since single-threaded processes were still used here, waiting for a response from Backup would slow down the entire processing flow, and therefore we made a reasonable compromise: we checked the result of the previous transaction.
The scheme worked as follows.
Suppose the main server stopped responding, but Gateway continues to interact. On the backup server, a timeout is triggered, it refers to the Governor, and he assigns him the role of the main server, and all Gateway switches to the new main server.
If the main server is back on line, it also triggers an internal timeout, because for some time there were no calls from the Gateway to the server. Then he also refers to the Governor, and he excludes him from the scheme. As a result, the exchange works with one server until the end of the trading period. Since the probability of server failure is quite low, this scheme was considered quite acceptable, it did not contain complex logic and was easily tested.
To be continued.