Approx. trans. : The staff of Tinder, the world famous service, recently shared some technical details on the migration of their infrastructure to Kubernetes. The process took almost two years and resulted in the launch of a very large-scale platform on K8s, consisting of 200 services located on 48 thousand containers. What interesting difficulties did Tinder engineers face and what results came up - read this translation.
Almost two years ago, Tinder decided to transfer its platform to Kubernetes. Kubernetes would allow the Tinder team to containerize and transition to operation with minimal effort through the continued deployment of (immutable deployment)
. In this case, the application build, its deployment and the infrastructure itself would be uniquely identified by code.
We also looked for a solution to scalability and stability. When scaling became critical, we often had a few minutes to wait for the launch of new EC2 instances. Therefore, the idea of launching containers and starting traffic maintenance in seconds instead of minutes was very attractive for us.
The process was not easy. During the migration, in early 2019, the Kubernetes cluster reached a critical mass and we began to face various problems due to traffic volume, cluster size and DNS. On this trip we solved a lot of interesting problems related to the transfer of 200 services and the maintenance of the Kubernetes cluster, consisting of 1000 nodes, 15000 pods and 48000 working containers.
Since January 2018, we have gone through various stages of migration. We started with the containerization of all our services and their deployment in Kubernetes test environments. Since October, the process of methodical transfer of all existing services to Kubernetes has started. By March of next year, the relocation was completed and now the Tinder platform works exclusively on Kubernetes.
Build images for Kubernetes
We have over 30 source code repositories for microservices operating in the Kubernetes cluster. The code in these repositories is written in different languages (for example, Node.js, Java, Scala, Go) with many runtime environments for the same language.
The build system is designed to provide a fully customizable “build context” for each microservice. It usually consists of a Dockerfile and a list of shell commands. Their contents are fully customizable, and at the same time all of these build contexts are written in accordance with a standardized format. Standardizing build contexts allows a single build system to handle all microservices.
Figure 1-1. Standardized build process through Builder
To achieve maximum consistency between runtimes, the same build process is used during development and testing. We faced a very interesting challenge: we had to develop a way to guarantee the consistency of the assembly environment throughout the platform. To do this, all assembly processes are carried out inside a special container Builder
Its implementation required advanced techniques for working with Docker. Builder inherits the local user ID and secrets (for example, SSH key, AWS credentials, etc.) required to access the Tinder private repositories. It mounts local directories containing source codes in order to naturally store build artifacts. This approach improves performance because it eliminates the need to copy assembly artifacts between the Builder container and the host. Stored assembly artifacts can be reused without additional configuration.
For some services, we had to create another container in order to match the compilation environment with the runtime environment (for example, during installation, the Node.js bcrypt library generates platform-specific binary artifacts). During the compilation process, requirements may vary for different services, and the final Dockerfile is compiled on the fly.
Kubernetes Cluster Architecture and Migration
Cluster Size Management
We decided to use kube-aws
for automated cluster deployment on Amazon EC2 instances. At the very beginning everything worked in one common pool of nodes. We quickly realized the need to separate workloads by size and instance type for more efficient use of resources. The logic was that the launch of several loaded multi-threaded pods turned out to be more predictable in performance than their coexistence with a large number of single-threaded pods.
As a result, we stopped at:
- m5.4xlarge - for monitoring (Prometheus);
- c5.4xlarge - for Node.js workload (single-threaded workload);
- c5.2xlarge - for Java and Go (multithreaded workload);
- c5.4xlarge - for the control panel (3 nodes).
One of the preparatory steps for migrating from the old infrastructure to Kubernetes was to redirect the existing direct interaction between services to the new load balancers (ELB, Elastic Load Balancers). They were created on a specific virtual private cloud (VPC) subnet. This subnet was connected to the VPC Kubernetes. This allowed us to transfer modules gradually, without taking into account the specific order of dependencies on services.
These endpoints were created using weighted sets of DNS records that had a CNAME of each new ELB. To switch, we added a new entry indicating a new Kubernetes ELB service with a weight equal to 0. Then we set the Time To Live (TTL) recordset to 0. After that, the old and new weights were slowly corrected, and eventually 100% of the load was sent on a new server. After the switch was completed, the TTL value returned to a more adequate level.
Java modules available to us coped with low TTL DNS, but Node applications did not. One of the engineers rewrote part of the connection pool code, wrapping it in a manager who updated pools every 60 seconds. The chosen approach worked very well and without noticeable performance degradation.
Network Factory Limits
Early in the morning of January 8, 2019, the Tinder platform unexpectedly “fell”. In response to an unrelated increase in platform waiting time, the number of pods and nodes increased in the cluster earlier that morning. This led to the exhaustion of the ARP cache on all of our nodes.
There are three Linux parameters associated with the ARP cache:
is a hard limit. The appearance in the log of the “neighbor table overflow” records meant that even after synchronous garbage collection (GC), there was not enough space in the ARP cache to store the neighboring record. In this case, the kernel simply discarded the package completely.
We use Flannel
as a network fabric (network fabric)
in Kubernetes. Packets are transmitted via VXLAN. VXLAN is a L2 tunnel that is raised on top of the L3 network. The technology uses MAC-in-UDP (MAC Address-in-User Datagram Protocol) encapsulation and allows you to expand network segments of the 2nd level. The transport protocol in the physical network of the data center is IP plus UDP.
Figure 2-1. Flannel diagram ( source )
Figure 2-2. VXLAN package ( source )
Each working Kubernetes node allocates a virtual address space with a mask/24 from a larger block/9. For each node, this means
one entry in the routing table, one an entry in the ARP table (on flannel.1
) and one entry in the switching table (FDB). They are added when the work node is first launched or each new node is detected.
In addition, the node-pod (or pod-pod) connection ultimately goes through the eth0
interface (as shown in the Flannel diagram above). This results in an additional entry in the ARP table for each corresponding source and destination of the node.
In our environment, this type of communication is quite common. For service type objects, an ELB is created in Kubernetes and Kubernetes registers each node in the ELB. ELB knows nothing about pods and the selected node may not be the final destination of the packet. The fact is that when a node receives a packet from the ELB, it considers it, taking into account iptables
rules for a particular service, and randomly selects a pod on another node.
At the time of the cluster failure there were 605 nodes. For the reasons stated above, this was enough to overcome the default gc_thresh3
value. When this happens, not only packets begin to be dropped, but the entire Flannel virtual address space with the mask/24 disappears from the ARP table. Host-pod communications and DNS queries are terminated (DNS is hosted in a cluster; for details, see later in this article.)
To solve this problem, you need to increase the values of gc_thresh1
and restart Flannel to reregister the missing networks.
DNS unexpected scaling
During the migration process, we actively used DNS to manage traffic and gradually transfer services from the old infrastructure to Kubernetes. We set relatively low TTL values for Linked RecordSets in Route53. When the old infrastructure was running on EC2 instances, the configuration of our resolver pointed to the Amazon DNS. We took it for granted and the impact of a low TTL on our Amazon services (eg DynamoDB) went almost unnoticed.
As we moved services to Kubernetes, we found that DNS processes 250 thousand requests per second. As a result, applications began to experience constant and serious timeouts on DNS queries. This happened despite the incredible efforts to optimize and switch the DNS provider to CoreDNS (which at peak load reached 1000 pods running on 120 cores).
Exploring other possible causes and solutions, we found a article
describing race conditions affecting on the packet filtering netfilter
framework in Linux. The timeouts we observed, together with the increasing insert_failed
counter in the Flannel interface, corresponded to the conclusions of the article.
The problem occurs during the Source and Destination Network Address Translation (SNAT and DNAT) stages and then adding to the conntrack
table. One of the workarounds discussed within the company and proposed by the community was the transfer of the DNS to the working node itself. In this case:
- SNAT is not needed because traffic remains inside the node.It does not need to be carried out through the interface eth0 .
- DNAT is not needed, because the destination IP is local to the host, and not randomly selected by the pod according to the rules of iptables .
We decided to stick with this approach. CoreDNS was deployed as a DaemonSet in Kubernetes and we implemented the local DNS server of the node in the resolv.conf
of each pod by setting the - cluster-dns
flag to the kubelet . This solution has proven effective for DNS timeouts.
However, we still observed packet loss and an increase in the insert_failed counter in the Flannel interface. This situation persisted after the introduction of a workaround, since we managed to exclude SNAT and/or DNAT only for DNS traffic. Race conditions persist for other types of traffic. Fortunately, most of the packets we have are TCP, and if a problem occurs, they are simply re-transmitted. We are still trying to find a suitable solution for all types of traffic.
Use Envoy for better load balancing
As the backend services migrated to Kubernetes, we began to suffer from an unbalanced load between the pods. We found that due to HTTP Keepalive, ELB connections were hanging on the first ready-made pods of each deployed deployment. Thus, the bulk of the traffic went through a small percentage of the available pods. The first solution we tested was setting the MaxSurge parameter to 100% on new deployment’s for the worst cases. The effect was insignificant and unpromising in terms of larger deployments.
Another solution we used was to artificially increase requests for resources for critical services. In this case, the adjacent pods would have more room for maneuver compared to other heavy pods. In the long run, it would also not work because of the waste of resources. In addition, our Node-applications were single-threaded and, accordingly, could only use one core. The only real solution was to use better load balancing.
We have long wanted to fully appreciate the Envoy . The current situation allowed us to expand it in a very limited way and get immediate results. Envoy is a high-performance, open source proxy of the seventh level, designed for large SOA applications. He is able to apply advanced load balancing methods, including automatic repetitions, circuit breakers and global speed limit. ( Approx. : For more information, read the recent article about Istio - service mesh, which is based on Envoy.)
We came up with the following configuration: have an Envoy sidecar for each pod and a single route, and the cluster is connected to the container locally by the port. To minimize potential cascading and keep a small “defeat” radius, we used the Envoy front-proxy pod fleet, one for each access zone (Availability Zone, AZ) for each service. They turned to a simple service discovery mechanism written by one of our engineers, who simply returned a list of the pods in each AZ for that service.
Then, the service front-envoys used this service discovery mechanism with one upstream cluster and route. We set adequate timeouts, increased all circuit breaker settings, and added minimal repetition configuration to help with single failures and ensure smooth deployment. In front of each of these service front-Envoy'ev we placed a TCP ELB. Even if keepalive from our main proxy layer hung on some Envoy pods, they could still handle the load much better and were set to balance through the least_request in the backend.
For deployment, we used the preStop hook on both pod applications and sidecar pods. Hook initiated an error in checking the status of the admin endpoint located on the sidecar-container, and “slept” for some time in order to allow the active connections to end.
One of the reasons why we were able to advance in solving problems so quickly was due to detailed metrics, which we could easily integrate into a regular Prometheus installation. It was possible to see exactly what was happening with them while we were selecting configuration parameters and redistributing traffic.
The results were immediate and obvious. We started with the most unbalanced services, and at the moment it is operating before the 12 most important services in the cluster. This year we are planning to switch to a full service mesh with more advanced service detection, circuit breaking, outlier detection, speed limiting and tracing.
Figure 3-1. CPU convergence of one service during the transition to Envoy
Thanks to the experience gained and additional research, we have created a strong infrastructure team with good skills in the design, deployment and operation of large Kubernetes clusters. All Tinder engineers now have the knowledge and experience to package containers and deploy applications to Kubernetes.
When there was a need for additional capacity on the old infrastructure, we had to wait for a few minutes to launch the new EC2 instances. Now the containers start and start processing traffic within a few seconds instead of minutes. Planning for multiple containers on a single EC2 also provides improved horizontal concentration. As a result, in 2019, we expect a significant reduction in costs for EC2 compared to last year.
It took almost two years to migrate, but we completed it in March 2019. Currently, the Tinder platform runs exclusively on the Kubernetes cluster, consisting of 200 services, 1000 nodes, 15,000 pods and 48,000 running containers. Infrastructure is no longer the domain of operating teams only. All our engineers share this responsibility and control the process of building and deploying their applications only with code.
P.S. by translator
Read also in our blog series of articles: