RBKmoney Payments under the hood - the infrastructure of the payment platform

RBKmoney Payments under the hood - the infrastructure of the payment platform



Hi, Habr! Job Description entrails of a large payment platform would be logical to continue with a description of how all these components work in the real world on physical gland. In this post, I’ll tell you how and where platform applications are placed, how traffic from the outside world reaches them, and also describe the layout of a standard rack for us with equipment located in any of our data centers.


Approaches and limitations



One of the first requirements that we formulated before developing a platform sounds like "the ability to close to linear scaling of computing resources to ensure the processing of any number of transactions."


The classical approaches of the paid systems used by market participants imply a ceiling, albeit rather high according to the statements. It usually sounds like this: "our processing can take 1000 transactions per second."


This approach does not fit into our business objectives and architecture. We do not want to have any limit. In fact, it would be strange to hear from Yandex or Google a statement "we can process 1 million search queries per second." The platform should process as many requests as the business needs at the moment due to the architecture, which allows, to put it simply, send it to DC of ITshnik with a cart of servers that it installs in racks, connects to the switchboard and leaves. And the platform orchestrator will roll out copies of business applications to new capacities, as a result of which we will get the increase in RPS we need.


The second important requirement is to ensure high availability of the services provided. It would be fun, but not very useful, to create a payment platform that can accept an infinite number of payments in/dev/null.


Perhaps the most effective way to achieve high availability is to duplicate entities that serve a service so that the failure of any reasonable number of applications, equipment or data centers does not affect the overall availability of the platform.


Repeated duplication of applications requires a large number of physical servers and related network equipment. This iron costs money, the amount of which from us of course, we cannot afford to buy a lot of expensive iron. So the platform is designed in such a way that it is easy to place and feel good on a large number of inexpensive and not too powerful hardware, or even in a public cloud.


The use of servers that are not the strongest in terms of computing power has its advantages - their failure does not have a critical effect on the overall state of the system as a whole. Imagine what is better - if an expensive, large and super-reliable branded server is running that uses a master-slave DBMS (and according to Murphy’s law, it will necessarily burn, and on December 31 in the evening) or a couple of servers in a cluster of 30 nodes running masterless -scheme?


Based on this logic, we decided not to create another massive point of failure for ourselves in the form of a centralized disk array. Common block devices are provided by a Ceph cluster deployed hyper-convergently on the same servers, but with a separate network infrastructure.


Thus, we logically came to the general scheme of a universal rack with computing resources in the form of inexpensive and not very powerful servers in the data center. If we need more resources, we either finish off any free rack with servers, or put another one, preferably closer.



Well, in the end, it's just beautiful. When a clear amount of the same iron is installed in the racks, this allows us to solve problems with high-quality wire-laying equipment, allows us to get rid of swallow nests and the danger of getting entangled in wires by dropping processing. A good from an engineering point of view, the system should be beautiful everywhere - and from the inside in the form of code, and outside in the form of servers and network hardware. A beautiful system works better and more reliably, I have enough examples to see this from personal experience.


Please do not think that we are small-scale procurers or a business that pinches us of financing. Developing and maintaining a distributed platform is actually very expensive. In fact, it is even more expensive than owning a classic system built, conditionally, on powerful brand hardware with Oracle/MSSQL, application servers and other binding.


Our approach pays off with high reliability, very flexible horizontal scaling options, no ceiling on the number of payments per second, and strange as it may sound, a lot of fun for the IT team. For me, the level of enjoyment of developers and devops from the system they create is no less important than the predictable development timeframe, the quantity and quality of the features being rolled out.


Server Infrastructure



Logically, our server capacities can be divided into two main classes: servers for hypervisors, for which the density of CPU cores and RAM per unit is important, and storage servers, where the main emphasis is placed on the amount of disk space per unit, and CPU and RAM already matched to the number of disks.


Currently, our classic server for computing power looks like this:


  • 2xXeon E5-2630 CPU;
  • 128G RAM;
  • 3xSATA SSD (Ceph SSD pool);
  • 1xNVMe SSD (dm-cache).

Server for storing states:


  • 1xXeon E5-2630 CPU;
  • 12-16 HDD;
  • 2 SSD for block.db;
  • 32G RAM.

Network Infrastructure


In the choice of network hardware, our approach is somewhat different. For switching and routing between vlans, we still use branded switches, now they are Cisco SG500X-48 and Cisco Nexus C5020 in SAN.


Physically, each server is connected to the network by 4 physical ports:


  • 2x1GbE - management network and RPC between applications;
  • 2x10GbE — network for storage.

Interfaces inside the machines are combined by bonding, further tagged traffic diverges on the desired vlan-am.


Perhaps this is the only place in our infrastructure where you can see the label of a famous vendor. Because for routing, network filtering and traffic inspection, we use linux hosts. We do not buy specialized routers. All that we need we configure on servers running Gentoo (iptables for filtering, BIRD for dynamic routing, Suricata as IDS/IPS, Wallarm as WAF).


Typical counter in DC


Schematically, when scaling racks in DCs, they practically do not differ from each other except for routers to uplinks, which are installed in one of them.


The exact proportions of servers of different classes can vary, but in general the logic is preserved - there are more servers for calculations than servers for data storage.



Block devices and resource sharing


Let's try to put everything together.Imagine that we need to place several of our microservices in the infrastructure, for greater clarity, these will be microservices that need to communicate with each other via RPC and one of them is Machinegun, which stores the state in the Riak cluster, as well as some ancillary services such as ES and Consul nodes.


A typical layout will look like this:



For VMs with applications that require maximum block device speed, like Riak and Elasticsearch hot nodes, partitions on local NVMe disks are used. Such VMs are tightly bound to their hypervisor, and applications themselves are responsible for the availability and integrity of their data.


For shared block devices, we use Ceph RBD, usually with write-through dm-cache on a local NVMe disk. OSD for the device can be either full-flash or HDD with SSD log, depending on the desired response time.


Delivery of traffic to applications



To balance requests coming from outside, we use the standard OSPFv3 ECMP scheme. Small virtual machines with nginx, bird, consul announce in the OSPF cloud shared anycast addresses from the lo interface. On routers for these addresses, bird creates multi-hop routes that provide per-flow balancing, where flow is "src-ip src-port dst-ip dst-port". To quickly disable the missing balancer, the BFD protocol is used.


When any of the balancers are added or failed, the upstream routers appear or delete the corresponding route, and the network traffic is delivered to them according to Equal-cost multi-path approaches. And if we do not specifically intervene, then all network traffic is evenly distributed to all available balancers on the IP stream to each.


By the way, the ECMP-balanced approach has unobvious reefs , which may lead to a completely unobvious loss of part of the traffic, especially if there are other routers or strangely configured firewalls on the route between the systems.


To solve the problem, we use the PMTUD daemon in this part of the infrastructure.


Next, the traffic goes inside the platform to specific microservices according to the configuration of nginx on balancers.


And if everything is more or less simple and clear with balancing outdoor traffic, then it would be difficult to extend such a scheme further inwards - we need more than just checking the availability of a microservice container at the network level.


In order for microservice to start receiving and processing requests, it must register with Service Discovery (we use Consul ), to pass every second health check s and have a reasonable RTT.


If microservice feels and behaves well, Consul begins to resolve the address of its container when accessing its DNS by the name of the service. We use the internal zone service.consul , and, for example, Common API microservice version 2 will be named capi-v2.service.consul .


The nginx config file for balancing in the end looks like this:


  location =/v2/{
  set $ service_hostname "$ {staging_pass} capi-v2.service.consul";
  proxy_pass http://$ service_hostname: 8022;
 }  

Thus, if we again do not intervene on purpose, the traffic from the balancers is evenly distributed among all microservices registered in Service Discovery, adding or deleting new instances of the necessary microservices is fully automated.


If the request from the balancer went upstream, and he died on the way, we return 502 to the outside - the balancer at its level cannot determine whether the request was idempotent or not, therefore, we give processing of such errors to a higher level of logic.


Idempotency and deadlines


In general, we are not afraid and do not hesitate to give 5xx errors with the API, this is a normal part of the system if we make the correct handling of such errors at the RPC business logic level. We describe the principles of this processing in the form of a small manual called the Errors Retry Policy, we distribute it to our merchant clients and implement it within our services.


To simplify this processing, we have implemented several approaches.


First, for any status-changing requests to our API, you can specify a unique idempotency key within the account that lives forever and allows you to be sure that a repeated call with the same data set will return the same answer.


Secondly, we implemented an additional mechanism in the form of a unique identifier for the payment session, which guarantees the idempotency of withdrawal requests, providing protection against erroneous repeated charges, even if you do not generate or transmit a separate idempotency key.


Third, we decided to give the predictable and externally controlled response time to any external call to our API in the form of a time cutoff parameter that determines the maximum wait time for the operation to complete on request. It is enough to send, for example, the X-Request-Deadline: 10s HTTP header to be sure that your request will be executed within 10 seconds or be killed by the platform somewhere inside, after which we will be able to contact us again, guided by the request re-sending policy.


Manage and own the platform


We use SaltStack as a tool for managing both configurations and infrastructure in general. Separate tools for automated control of the computing power of the platform have not yet taken off, although now we understand that we will go in this direction. With our love for products Hashicorp this is likely to be Nomad.


The main infrastructure monitoring tools are checks in Nagios, but for business entities, we basically set up alerts in Grafana. It is a very convenient tool for setting conditions, and the event-based platform model allows you to write everything into Elasticsearch and customize the selection conditions.



Data centers are located in Moscow, where we rent spacers, install and manage all equipment on our own. Nowhere do we use dark optics, we only have Internet from local providers outside.


Otherwise, our approaches to monitoring, management and related services are rather standard for the industry, not sure that the next description of the integration of these services is worth mentioning in the post.


On this article, I’ll probably finish the cycle of review posts about how our payment platform is organized.


I think that the cycle turned out to be quite frank, I met few articles that would reveal the internal kitchen of large payment systems in such detail.


In general, in my opinion, a high level of openness and candor is a very important thing for the payment system. This approach not only increases the level of trust of partners and payers, but also disciplines the team, creators and service operators.


So, guided by this principle, we recently made publicly available the platform status and uptime history of our services. The entire subsequent history of our uptime, updates and downs is now public and available at https://status.rbk.money/.


I hope you find it interesting, and perhaps someone will find our approaches and described errors helpful. If you are interested in any of the directions described in the posts, and you would like me to disclose them in more detail, please do not hesitate to write in the comments or in a personal.


Thank you for being with us!



P.S. For your convenience, a pointer to the previous articles of the cycle:


Source text: RBKmoney Payments under the hood - the infrastructure of the payment platform