The architecture of billing a new generation: the transformation with the transition to Tarantool

The architecture of billing a new generation: the transformation with the transition to Tarantool


Why a corporation like MegaFon, Tarantool in billing? From the outside, it seems that a vendor usually comes in, brings some large box, sticks the plug into the outlet - that's the billing! Once it was, but now it is archaic, and such dinosaurs have become extinct or are dying out. Initially, billing is a billing system - counting or calculator. In modern telecom, this is an automation system for the entire life cycle of interaction with a subscriber from entering into a contract to termination , including real-time billing, payment acceptance and much more. Billing in telecom companies is like a fighting robot - big, powerful and hung with weapons.



And here is Tarantool? Oleg Ivlev and Andrey Knyazev will tell about this. Oleg is the main architect of MegaFon with extensive experience in foreign companies, Andrei is the director of business systems. From the transcript of their report at Tarantool Conference 2018 you will know why R & D is needed in corporations, what Tarantool is, how deadlock is vertical scaling globalization has become a prerequisite for the emergence of this database in the company, about technological challenges, the transformation of architecture, and how MegaFon's technostec is similar to Netflix, Google and Amazon.



Unified Billing Project


The project, which will be discussed, is called “Unified Billing”. It was in him that Tarantool showed its best qualities.



The growth of Hi-End equipment performance did not keep pace with the growth of the subscriber base and the increase in the number of services; further growth in the number of subscribers and services was expected due to M2M, IoT, and affiliate features led to a deterioration in time-to-market. The company decided to create a unified business system with a unique, world-class modular architecture, instead of 8 current different billing systems.

MegaFon is eight companies in one . In 2009, the reorganization was completed: branches throughout Russia merged into a single company OJSC MegaFon (now PJSC). Thus, the company has 8 billing systems with its own "custom" solutions, branch office features and different organizational structure, IT and marketing.

Everything was good, until I had to launch one common federal product. There appeared a lot of difficulties: someone billing with a rounding up, someone less, and someone - according to the arithmetic average. Thousands of such moments.

Despite the fact that the version of the billing system is one, one supplier, the settings diverged so that the glue for a long time. We tried to reduce their number, and came across a second problem, which is familiar to many corporations.

Vertical Scaling . Even the coolest at the time iron did not meet the needs. We used equipment from Hewlett-Packard, Superdome Hi-End line, but it didn’t pull even the need of two branches. I wanted horizontal scaling without large operating costs and capital investments.

Waiting for growth in the number of subscribers and services . Consultants have long brought to the telecom world stories about IoT and M2M: there will come times when there will be a sim card on each phone and iron, and two in the fridge. Today we have one number of subscribers, and in the near future they will be much more.

Technological Challenges


These four reasons moved us to major changes. There was a choice between upgrading the system and designing from scratch. They thought for a long time, made serious decisions, played tenders. In the end, we decided to design from the very beginning, and took up interesting challenges - technological challenges.

Scalability


If earlier it was, let's say, 8 billing services for 15 million subscribers , but now it should have turned out 100 million subscribers and more - the load is much higher.

We became comparable in scale to major online players like Mail.ru or Netflix.

But further movement to increase the load and the subscriber base has set serious tasks for us.

Geography of our immense country


Between Kaliningrad and Vladivostok 7,500 km and 10 time zones . The speed of light is finite and at such distances the delays are already significant. 150 ms on the coolest modern optical channels is a bit too much for real-time-billing, especially such as is now in telecom in Russia. In addition, you need to update in one working day, and with different time zones this is a problem.

We do not just provide services for a monthly fee, we have complex tariffs, packages, various modifiers. We should not just allow or prohibit the subscriber to talk, but give him a certain quota - calculate short calls and actions in real time so that he does not notice.

Fault tolerance


This is the flip side of centralization.

If we collect all subscribers in one system, then any emergency events and disasters are bad for business. Therefore, the system is designed to eliminate the impact of accidents on the entire subscriber base.

This is a consequence of the rejection of vertical scaling. When we went into horizontal scaling, we increased the number of servers from hundreds to thousands. They need to manage and build interchangeability, automatically back up the IT infrastructure and restore the distributed system.

Such interesting challenges faced us. We designed the system, and at that moment we tried to find global best practices, to check how up-to-date we were, how far we were following advanced technologies.

Worldwide Experience


Surprisingly, we have not found a single reference in world telecom.

Europe has disappeared by the number of subscribers and the scale, the USA - by the plane of its tariffs. Something looked in China, and something found in India and took experts from Vodafone India.

To analyze the architecture, assembled the Dream Team led by IBM - architects from different areas. These people could adequately evaluate what we are doing and bring some knowledge to our architecture.

Scale


A few numbers to illustrate.

We design a system for 80 million subscribers with a billion-plus reserve . So we remove future thresholds. This is not because we are going to seize China, but because of the pressure of IoT and M2M.

300 million documents are processed in real time . Although we have 80 million subscribers, we also work with potential customers, and with those who have left us, if you need to collect receivables. Therefore, real volumes are noticeably larger.

2 billion transactions daily change the balance - it is payments, charges, calls and other events. 200 TB of data is changing actively , 8 PB of data are changing a little slower, and this is not an archive, but live data in a single billing. The scale for data centers - 5 thousand servers on 14 sites .

Technological Stack


When we planned the architecture and undertook to assemble the system, we imported the most interesting and advanced technologies. The result was a technological stack, familiar to any Internet player and corporations that make high-load systems.



The stack is similar to stacks of other major players: Netflix, Twitter, Viber. It consists of 6 components, but we want to reduce and unify it.

Flexibility is good, but in a large corporation without unification in any way.

We are not going to change the same Oracle to Tarantool. In the realities of large companies, this is a utopia, or a crusade for 5-10 years with an incomprehensible outcome. But Cassandra and Couchbase can be easily replaced with Tarantool, and we are striving for this.

Why Tarantool?


There are 4 simple criteria why we chose this database.

Speed ​​. We conducted load tests on MegaFon industrial systems. Tarantool won - it showed better performance.

It cannot be said that other systems do not meet the needs of MegaFon. Current memory-solutions are so productive that this stock of the company is more than enough. But we are interested in dealing with the leader, and not with the one who lags behind, including the load test.

Tarantool covers the company's needs even in the long term.

TCO cost . Couchbase support on MegaFon costs cosmic money, but with Tarantool, the situation is much nicer, and in terms of functionality they are close.

Another nice feature that has had a little impact on our choice is that Tarantool works better than other databases with memory. It shows maximum efficiency .

Reliability . MegaFon is invested in reliability, probably, like no other. Therefore, when we looked at Tarantool, we realized that we need to make it so that it meets our requirements.

We invested our time and finances, and together with Mail.ru we created an enterprise version, which is already used by several other companies.

Tarantool-enterprise fully satisfied us with respect to security, reliability, logging.

Partnerships


The most important thing for me is direct contact with the developer . This is exactly what the guys from Tarantool bought.

If you come to the player, especially who works with the anchor client, and say that you need the database to be able to do it, this and that, usually he answers:

- Well, put the requirements under the bottom of that pile - someday, we will probably get to them.

Many have a roadmap for the next 2-3 years, and it’s almost impossible to build in there, and the Tarantool developers are captivating with openness, and not only with MegaFon, and adapt their system to the customer. It's cool and we really like it.

Where we applied Tarantool


We have Tarantool used in several elements. First - in the pilot , which we did on the system of the address directory. At one time, I wanted it to be a system that is similar to Yandex.Maps and Google Maps, but it turned out a little differently.

For example, the address directory in the sales interface. On Oracle, finding the right address takes 12-13 seconds. - uncomfortable numbers. When we switch to Tarantool, replace Oracle with another database in the console, and perform the same search, we get acceleration 200 times! The city pops up after the third letter. Now we are adapting the interface to make it happen after the first one. However, the response speed is completely different - already milliseconds instead of seconds.

The second application is a trendy topic called IT 2-speed . All because the consultants from each iron say that corporations should go there.



Here there is a layer of infrastructure, above it domains, for example, a billing system like a telecom, corporate systems, corporate reporting. This is the core that does not need to touch.That is, of course, possible, but paranoid providing quality, because it brings the corporation money.

Next comes the microservice layer - that differentiates the operator or another player. Microservices can be quickly created on the basis of some caches, raising data from different domains there. Here field for experiments - if something did not work out, close one microservice, open another. This provides a truly enhanced time-to-market and increases the reliability and speed of the company.

Microservices is perhaps the main role of Tarantool in MegaFon.

Where do we plan to apply Tarantool


If we compare our successful billing project with the transformation programs in Deutsche Telekom, Svyazkom, Vodafone India, it is surprisingly dynamic and creative. In the process of implementing this project, not only MegaFon and its structure transformed, but also Tarantool-enterprise appeared at Mail.ru, and our vendor Nexign (formerly Peter-Service) had a BSS Box (boxed billing solution).

This is, in a sense, a historical project for the Russian market. It can be compared with what is described in the book “The Mythical Man-Month” by Frederick Brooks. Then, in the 60s, IBM attracted 5,000 people to develop the OS/360 operating system for mainframes IBM. We have less - 1,800, but ours are in vests, and taking into account the use of open-source and new approaches, we work more productively.

Below are the billing domains or, to put it more broadly, the business systems. People from the enterprise are well aware of CRM. Other systems should already be available to everyone: Open API, API Gateway.



Open API


Let's look again at the numbers and how the Open API works now. Its load is 10,000 transactions per second . Since we plan to actively develop the microservice layer and build the MegaFon public API, we expect more growth in this part in the future. 100,000 transactions will definitely be .

I don’t know if SSO can be compared with Mail.ru - guys, like, 1,000 0000 transactions per second. We are extremely interested in their solution and we plan to learn from their experience - for example, to make a functional SSO reserve using Tarantool. Now the developers of Mail.ru are doing this with us.

CRM


CRM - these are the very 80 million subscribers that we want to bring to a billion, because there are already 300 million documents that include a three-year history. We are really looking forward to new services, and here the growth point is connected services . This is a ball that will grow, because there will be more and more services. Accordingly, we will need a story, we do not want to stumble on this.

The billing itself in terms of billing, work with clients' receivables has been transformed into a separate domain . To enhance performance, a domain architecture architectural pattern is applied.

The system is divided into domains, the load is distributed and fault tolerance is provided. Additionally, we conducted work with a distributed architecture.

All the rest are enterprise level solutions. Call storage - 2 billion per day , 60 billion per month. Sometimes you have to recount them in a month, and better quickly. Financial Monitoring is the very 300 million that is constantly growing and growing: subscribers often run between operators, increasing this part.

The most telecom component in mobile communications is online charging . These are the systems that allow you to call or not to call, make a decision in real time. Here, the load is 30,000 transactions per second, but given the growth in data transfer, we plan 250,000 transactions , and therefore we are greatly interested in Tarantool.

The previous picture is the domain where we are going to use Tarantool.CRM itself, of course, is wider and we are going to apply it in the core itself.

Our estimated performance figure of 100 million subscribers confuses me as an architect - what if 101 million? To redo everything again? To prevent this, we use caches, at the same time increasing accessibility.



In general, there are two approaches to using Tarantool. The first is build all caches at the microservice level . As far as I understand, VimpelCom follows this path, creating a client cache.

We are less dependent on vendors, we are changing the core of the BSS, so we have a single card file of customers already out of the box. But we want to embroider it. Therefore, we use a slightly different approach - make caches inside systems .

So less rassinhrona - one system is responsible for the cache, and for the main master source.

The method fits well with the Tarantool approach with the transactional skeleton, when only the parts that relate to updates, that is, data changes, are updated. Everything else can be stored somewhere else. No huge data lake, unmanaged global cache. Caches are designed for the system, either for products, or for customers, or to make life easier for service. When a subscriber is disturbed by the quality, I want to serve him qualitatively.

RTO and RPO


There are two terms in IT - RTO and RPO .

Recovery time objective is the time to restore a service after a crash. RTO = 0 means that even if something falls, the service continues to work.

Rrecovery point objective is the data recovery time, how much data we can lose over a period of time. RPO = 0 means we don’t lose data.

Tarantool Task


Let's try to solve a problem for Tarantool.

Given : a clear basket of applications for everyone, for example, in Amazon or elsewhere. It is required for the basket to work 24 hours 7 days a week, or 99.99% of the time. Orders that come to us must maintain order, because we cannot randomly turn the connection on or off to the subscriber — everything must be strictly sequential. The previous subscription affects the next one, so the data is important - nothing should be lost.

Solution . You can try to solve in the forehead and ask the developers of the database, but the problem is not mathematically solved. We can recall theorems, conservation laws, quantum physics, but why - it cannot be solved at the database level.

The good old architectural approach works here - you need to know the subject area well and at its expense resolve this rebus.



Our solution: create a distributed register of applications for Tarantool - a geo-distributed cluster . In the diagram, these are three different data centers - two to the Urals, one after the Urals, and we distribute all requests to these centers.

Netflix, which is now considered one of the leaders in IT, until 2012 had only one data center. On the eve of the Catholic Christmas on December 24, this data center lay down. Users of Canada and the United States were left without their favorite movies, they were very upset and wrote about it in social networks. Netflix now has three data centers on the west-east coast and one in western Europe.

We initially build a geo-distributed solution — fault tolerance is important to us.

So, we have a cluster, but what about RPO = 0 and RTO = 0? The solution is simple, which depends on the subject matter.

What is important in applications? Two parts: sketching the basket BEFORE making a purchase decision, and AFTER . A part of DL in a telecom is usually called order capturing or order negotiation .In telecom, this can be much more complicated than in an online store, because there you have to serve the customer, offer 5 options, and this all happens for a while, but the basket is full. At this point, failure is possible, but it's not scary, because it happens interactively under the supervision of a person.

If the Moscow data center suddenly fails, then switching automatically to another data center, we will continue to work. Theoretically, one product in a basket can be lost, but you can see it, complete the basket again and continue to work. In this case, RTO = 0.

At the same time there is a second option: when we clicked “submit”, we want the data not to be lost. From this point on, automation starts working - this is already RPO = 0. Applying these two different patterns in one case can be just a geo-distributed cluster with one switchable master, in the other case some quorum record. Patterns may vary, but we solve the problem.

Further, having a distributed register of applications, we can also scale it all up - have many dispatchers and executors who access this registry.



Cassandra and Tarantool together


There is another case - "showcase balances" . This is an interesting case of Cassandra and Tarantool sharing.

We use Cassandra, because 2 billion calls a day is not the limit, and there will be more. Marketers love to color traffic by source, there are more and more details on social networks, for example. It all enhances the story.

Cassandra allows you to scale horizontally to any volume.

We feel comfortable with Cassandra, but she has one problem - she is not good at reading. Everything is OK on the record, 30,000 per second is not a problem - reading problem .

Therefore, a topic with a cache appeared, and at the same time we solved the following problem: there is an old traditional case when the equipment from the switch from online billing comes in the files that we load into Cassandra. We dealt with the problem of reliably downloading these files, even using the advice of the IBM manager file transfer — there are solutions that manage the transfer of files efficiently using the UDP protocol, for example, and not TCP. This is good, but it’s still minutes, and while we don’t load it all up, the operator in the call center cannot answer the client what happened to his balance - we must wait.

To prevent this from happening, we apply parallel functional reserve . When we send an event through Kafka to Tarantool, recalculating units in real time, for example, today, we get a balance cache , which can give out balances at any speed, for example, 100 thousand transactions per second and 2 seconds.

The goal is that after making a call after 2 seconds in your account there is not only a changed balance, but information about why it has changed.

Conclusion


These were examples of using Tarantool. We really liked the openness of Mail.ru, their willingness to consider different cases.

To consultants from BCG or McKinsey, Accenture or IBM, it is already difficult to surprise us with something new - much of what they offer, we are already doing, or done, or planning to do. I think that Tarantool in our technological stack will take a worthy place and will replace many of the already existing technologies. We are in the active phase of development of this project.

The report by Oleg and Andrey is one of the best at the Tarantool Conference last year, and already on June 17, Oleg Ivlev will speak at T + Conference 2019 with the report "Why Tarantool in Enterprise" . Alexander Deulin will also give a talk from MegaFon on " Tarantool caches and replication from Oracle ". We find out what has changed, what plans have been realized. Join - the conference is free, you only need register . All reports are accepted and the conference program formed: new cases, new experience using Tarantool, architecture, enterprise, tutorials and microservices.

Source text: The architecture of billing a new generation: the transformation with the transition to Tarantool