This is a continuation of a long story about our thorny path to the creation of a powerful, highly loaded system ensuring the work of the Exchange. The first part is here
After numerous tests, the updated trading and clearing system was put into operation, and we were faced with a bug about which one can write a detective-mystical story.
Shortly after the launch on the main server, one of the transactions was processed with an error. At the same time on the backup server everything was in order. It turned out that the simple mathematical operation of calculating the exponent on the main server gave a negative result from the actual argument! We continued our research, and in the SSE2 register we found a difference in one bit, which is responsible for rounding when working with floating point numbers.
We wrote a simple test utility for calculating the exponent with the rounding bit set. It turned out that in the version of RedHat Linux that we used, there was a bug in working with a mathematical function when the ill-fated bit was inserted. We reported this to RedHat, after a while we received a patch from them and rolled it. The error no longer occurred, but it was not clear where did this bit come from? The
function from the C language was responsible for it. We carefully analyzed our code in the search for the alleged error: we checked all possible situations; reviewed all the functions that used rounding; tried to reproduce the failed session; used different compilers with different options; used static and dynamic analysis.
The cause of the error could not be found.
Then they started checking the hardware: they carried out load testing of the processors; checked the RAM; they even ran tests for a very unlikely scenario of a multi-bit error in a single cell. To no avail.
In the end, they settled on a theory from the world of high-energy physics: some high-energy particle flew into our data center, pierced the shell wall, hit the processor and caused the trigger latch to stick in that bit. This absurd theory was called "neutrino". If you are far from the physics of elementary particles: neutrinos almost do not interact with the outside world, and certainly not able to affect the operation of the processor.
Since it was not possible to find the cause of the failure, just in case the “guilty” server was excluded from service.
After some time, we began to improve the hot backup system: we introduced the so-called “warm reserves” (warm) - asynchronous replicas. They received a flow of transactions that may be located in different data centers, but at the same time, warm's did not support active interaction with other servers.
What was it done for? If the backup server fails, warm becomes attached to the main server. That is, after a failure, the system does not remain until the end of the trading session with one main server.
And when the new version of the system was tested and put into operation, an error with a rounding bit appeared again. Moreover, with the increase in the number of warm-servers, the error began to appear more often. In this case, the vendor had nothing to present, because there is no concrete evidence.
During the next analysis of the situation, a theory emerged that the problem could be related to the OS. We wrote a simple program that calls the
function in an infinite loop, remembers the current state and checks it through sleep, and this is done in a variety of competing streams.Selecting the sleep parameters and the number of threads, we began to consistently reproduce the failure of the bits after about 5 minutes of the utility. However, Red Hat support was unable to reproduce it. Testing of our other servers has shown that only those with certain processors are affected by the error. At the same time, the transition to the new core solved the problem. In the end, we just replaced the OS, and the true cause of the bug remained unexplained.
And suddenly last year on Habré article “ How I found a bug in Intel Skylake processors
”. The situation described in it was very similar to ours, but the author advanced further in the investigation and put forward the theory that the error was in microcode. And when updating Linux kernels, manufacturers also update the microcode.
Further development of the system
Although we got rid of the error, this story forced us to reconsider the system architecture. After all, we were not protected from the repetition of similar bugs.
The following redesigns of the backup system were based on the following principles:
- You can not believe anyone. Servers may not work properly.
- Majority reservations.
- Provide consensus. As a logical complement to a major backup.
- Double failures are possible.
- Vitality. The new scheme of hot backup should be no worse than the previous one. Trading should go smoothly down to the last server.
- Slight increase in latency. Any downtime involves huge financial losses.
- Minimal network interaction so that the delay is as short as possible.
- Select a new master server in seconds.
None of the solutions available on the market were suitable for us, and the Raft protocol was still in its infancy, so we created our own solution.
In addition to the backup system, we are engaged in upgrading the network interaction. The I/O subsystem was a multitude of processes, which in the worst way affected jitter and latency. Having hundreds of processes that handle TCP connections, we had to constantly switch between them, and on a microsecond scale, this is a rather lengthy operation. But the worst thing is that when the process received a packet for processing, it sent it to one SystemV queue, and then waited for events from another SystemV queue. However, with a large number of nodes, the arrival of a new TCP packet in one process and the receipt of data in a queue in another represent two competing events for the OS. In this case, if for both tasks there are no available physical processors, one will be processed, and the second will be in the waiting queue. It’s impossible to predict the consequences.
In such situations, you can apply dynamic process priority control, but this will require the use of resource-intensive system calls. As a result, we switched to one stream using the classic epoll, this greatly increased the speed and reduced the processing time of the transaction. We also got rid of individual processes of network interaction and interaction through SystemV, significantly reduced the number of system calls and began to monitor the priorities of operations. On only one I/O subsystem, it was possible to save about 8-17 microseconds, depending on the scenario. This single-threaded scheme has been applied unchanged since then, one epoll-stream with a margin is enough to serve all connections.
The growth of the load on our system required the modernization of almost all of its components. But, unfortunately, the stagnation in the growth of the clock frequency of processors in recent years has not allowed to scale the processes "head on."Therefore, we decided to divide the Engine process into three levels, the most loaded of which is the risk verification system, which assesses the availability of funds in the accounts and creates the transactions themselves. But money can be in different currencies, and it was necessary to figure out the principle on which to divide request processing.
The logical solution is to divide by currency: one server sells dollars, another pounds, the third euro. But if, under such a scheme, you send two transactions to buy different currencies, then there will be a problem of out of sync wallets. And synchronize is difficult and costly. Therefore, it will be correct to separately shardirovat on wallets and separately on instruments. By the way, most of the Western stock exchanges do not have the risk verification task as acutely as we do, which is why it is usually done offline. We also had to implement an online check.
Let us explain by example. The trader wants to buy 30 dollars, and the request goes to validate the transaction: we check whether this trader is admitted to this trading mode, if he has the necessary rights. If everything is in order, the request goes to the risk verification system, i.e. to check the adequacy of funds for the conclusion of the transaction. There is a note that the required amount is currently blocked. Then the request is forwarded to the trading system, which approves or does not approve of this transaction. Suppose the transaction is approved - then the risk verification system marks that the money is unlocked and the rubles are converted into dollars.
In general, the risk verification system contains complex algorithms and performs a large amount of very resource-intensive calculations, rather than just checking the “account balance”, as it may seem at first glance.
Starting the separation of the Engine process into levels, we encountered a problem: the code at that time in the validation and verification stages actively used the same data file, which required rewriting the entire code base. As a result, we borrowed the instruction processing technique from modern processors: each of them is divided into small stages and several actions are performed in parallel in one cycle.
After a small code adaptation, we created a parallel transaction processing pipeline, in which the transaction was divided into 4 pipeline stages: network interaction, validation, execution, and publication of the result
Consider an example. We have two processing systems, sequential and parallel. The first transaction comes, and in both systems it is sent for validation. A second transaction immediately arrives: in a parallel system, it is immediately taken to work, and in a sequential one it is put in a queue, waiting until the first transaction passes through the current processing stage. That is, the main advantage of pipelining is that we process the transaction queue faster.
So we got the ASTS + system.
True, with the conveyors, too, not so smooth. Suppose we have a transaction that affects the data arrays in a neighboring transaction, this is a typical situation for the exchange. Such a transaction cannot be executed in the pipeline, because it can affect others. This situation is called data hazard, and such transactions are simply processed separately: when the queued “fast” transactions end, the pipeline stops, the system processes the “slow” transaction, and then it starts the pipeline again. Fortunately, the share of such transactions in the total flow is very small, so the pipeline stops so rarely that it does not affect the overall performance.
Then we began to solve the problem of synchronization of three threads of execution.As a result, a system was born based on a ring buffer with fixed-sized cells. In this system, everything is subject to processing speed, data is not copied.
- All incoming network packets fall into the allocation stage.
- We place them in an array and mark that they are available for stage number 1.
- The second transaction has arrived, it is again available for stage number 1.
- The first processing flow sees available transactions, processes them, and moves them to the next stage of the second processing flow.
- It then processes the first transaction and marks the corresponding cell with the
deleted flag — it is now available for new use.
So the whole queue is processed.
Processing each stage takes units or tens of microseconds. And if you use the standard OS synchronization schemes, then we will lose more time on the synchronization itself. Therefore, we began to use spinlock. However, this is a very bad tone in a real-time system, and RedHat strongly recommends not doing this, so we apply a spinlock for 100 ms, and then switch to semaphore mode to eliminate the possibility of deadlock.
As a result, we achieved a performance of about 8 million transactions per second. And literally two months later in the article
about LMAX Disruptor we saw a description of the circuit with the same functionality.
Now at one stage there could be several execution threads. All transactions were processed in turn, in order of receipt. As a result, peak performance increased from 18 thousand to 50 thousand transactions per second.
Exchange risk management system
There is no limit to perfection, and soon we were again engaged in modernization: as part of ASTS +, we began to bring risk management systems and settlement operations into autonomous components. We have developed a flexible modern architecture and a new hierarchical risk model, tried wherever possible to use the
class instead of
But immediately a problem arose: how to synchronize all business logic that has been running for many years and transfer it to a new system? As a result, the first version of the prototype of the system had to be abandoned. The second version, which is currently working in production, is based on the same code that works in the trading part and in the risk part. During development, it was most difficult to do git merge between the two versions. Our colleague Yevgeny Mazurenok performed this operation every week and every time he cursed for a very long time.
When allocating a new system, I immediately had to solve the interaction problem. When choosing a data bus, it was necessary to ensure stable jitter and minimal latency. For this, the InfiniBand RDMA network is best suited: the average processing time is 4 times shorter than in 10 G Ethernet networks. But the difference in percentiles - 99 and 99.9 really bribed us.
Of course, InfiniBand has its own difficulties. First, another API is ibverbs instead of sockets. Secondly, there are almost no widely available open source messaging solutions. We tried to make our prototype, but it turned out to be very difficult, so we chose a commercial solution - Confinity Low Latency Messaging (formerly IBM MQ LLM).
Then came the challenge of properly separating the risk system. If you just take out the Risk Engine and do not make an intermediate node, then transactions from two sources can be mixed.
In the so-called Ultra Low Latency solutions, there is a reordering mode: transactions from two sources can be arranged in the required order upon receipt, this is implemented using a separate channel for exchanging information about sequencing. But we still do not use this mode: it complicates the whole process, and in a number of solutions is not supported at all. In addition, each transaction would have to assign appropriate time stamps, and in our scheme this mechanism is very difficult to implement correctly. Therefore, we used the classic message broker scheme, that is, the dispatcher, which distributes messages between the Risk Engine.
The second problem was related to client access: if there are several Risk Gateway, the client needs to connect to each of them, and for this you have to make changes to the client layer. We wanted to get away from this at this stage, so in the current scheme, the Risk Gateway processes the entire data stream. This greatly limits the maximum bandwidth, but greatly simplifies the integration of the system.
In our system there should not be a single point of failure, that is, all components must be duplicated, including the message broker. We solved this task with the help of the CLLM system: it contains an RCMS cluster, in which two controllers can operate in master-slave mode, and when one fails, the system automatically switches to the other.
Working with backup data center
InfiniBand is optimized to work as a local area network, that is, to connect rack equipment, and do not build an InfiniBand network between two geographically distributed data centers. Therefore, we have implemented a bridge/dispatcher that connects to a message store over ordinary Ethernet networks and relays all transactions to the second IB network. When you need a migration from the data center, we can choose which data center to work with right now.
All of the above was not done at once, it took several iterations of the development of a new architecture. We created the prototype in a month, but it took more than two years to finish working. We tried to achieve the best compromise between increasing the duration of transaction processing and increasing the reliability of the system.
Since the system was heavily updated, we implemented data recovery from two independent sources. If, for some reason, the message store does not function correctly, you can take the transaction log from the second source - from the Risk Engine. This principle is respected throughout the system.
Among other things, we managed to keep the client API so that neither brokers nor anyone else would require a significant rework for the new architecture. I had to change some interfaces, but I didn’t need to make significant changes to the work model.
We called the current version of our platform Rebus - as an abbreviation for the two most notable innovations in architecture, the Risk Engine and BUS.
Initially, we wanted to select only the clearing part, but the result was a huge distributed system. Customers can now interact with either the Gateway trade, or with clearing, or with both at once.
What we finally achieved:
Reduced the level of delay. With a small amount of transactions, the system works the same way as the previous version, but it can withstand much higher loads.
Peak performance increased from 50 thousand to 180 thousand transactions per second. Further enhancement is hampered by a single stream of application details.
There are two ways to further improve: paralleling matching and changing the scheme of working with Gateway.Now all Gateways are working according to a replication scheme, which at such a load ceases to function normally.
Finally, I can give some advice to those who are developing enterprise systems:
- Be prepared for the worst all the time. Problems always arise unexpectedly.
- Quickly redoing architecture is usually impossible. Especially if you need to achieve maximum reliability for a variety of indicators. The more nodes, the more support resources are needed.
- All special and proprietary solutions will additionally require resources for research, support and maintenance.
- Do not delay addressing issues of reliability and system recovery after failures, consider them at the initial design stage.