[From the sandbox] Five problems in the processes of operation and support of Highload IT systems

[From the sandbox] Five problems in the processes of operation and support of Highload IT systems


Hello, Habr! For ten years I have been supporting Highload IT systems. I will not write in this article about the problems of configuring nginx to work in 1000+ RPS mode or other technical things. I will share observations about the problems in the processes that arise in the support and operation of such systems.

Monitoring


Technical support does not wait until the request arrives with the content " What Why ... the site does not work again?". Support a minute after the fall of the site should already see the problem and begin to solve it. But the site is the tip of the iceberg . Its availability is put on monitoring one of the first.

How to deal with a situation when the remnants of goods online store stopped coming from the ERP system? Or a CRM system that calculates discounts for customers stopped responding? At the same time, the site seems to work. Conditional Zabbix gets its 200 response. The shift on duty did not receive any notification from monitoring and happily inspect the first series of the new game of thrones season.

Often, monitoring is limited only by measuring the state of memory, RAM, and the load on server processors. But it’s more important for business to get product availability on the site. The conditional drop of one virtual machine in a cluster will result in traffic stopping it and increasing the load on other servers. The company will not lose money.

Therefore, in addition to monitoring the technical parameters of the operating systems on the servers, you need to configure business metrics. Metrics that directly affect money. Various interactions with external systems (CRM, ERP and others). The number of orders for a certain period of time. Successful or unsuccessful client authorizations and other metrics.

Interaction with external systems


Any website or mobile application with an annual turnover of more than a billion rubles interacts with external systems. Starting from the aforementioned CRM and ERP and ending with the transfer of sales data to the external Big Data system for analyzing purchases and offering the customer a product that he will definitely buy (in fact, not). Each such system has its own support. And often communicating with these systems causes pain. Especially when the problem is global and you need to analyze it in different systems.

Some systems give a phone or telegram to their admins. Somewhere you need to write letters to managers or go to the bug trackers of these external systems. Even in the context of one large company, different systems often work in different systems for recording applications. Tracking the status of the application sometimes becomes impossible. You get a request in one conditional Jira. Then in the comments of this first Jira you put a link to the task in another Jira. In the second Jira in the application, someone already writes a comment that you need to call the conditional admin Andrew to resolve the issue. And so on.

The best solution to this problem would be to create a single space for communication, for example in Slack. Inviting all participants in the process of operating external systems. As well as a single tracker, so as not to duplicate the application. Applications should be tracked in one place, ranging from monitoring alerts to outputting a solution of the bugs in the prod. You will say that this is unrealistic and it has historically developed in you that we work in one tracker and they work in another. Different systems appeared, they had their own autonomous IT teams. I agree, and therefore the problem must be solved from above at the level of the CIO or product owner.

Each system with which you interact should provide support as a service with a clear SLA to solve problems by priority. And not when the conditional admin Andrew has a minute for you.

Man-bottleneck


Does everyone on the project (or product) have such a person, going on leave which causes the authorities convulsions? This may be a devops engineer, analyst or developer.After all, only the devops engineer knows which servers are installed on which containers, how to reload the container in case of a problem, and indeed, any complex problem cannot be solved without it. The analyst is the only one who knows how your complex mechanism works. What data streams go where. Under which parameters of requests to which services, which ones we will receive answers.
Who will quickly understand why the error logs and promptly fix a critical bug in the prode? Of course, the same developer. There are others, but for some reason only he understands how different modules of the system are arranged.

The root of this problem is the lack of documentation . After all, if all the services of your system would be described, then you could deal with the problem without an analyst. If devops picked out a couple of days from his busy schedule and described all the servers, services and instructions for solving typical problems, then the problem in its absence could be solved without it. It is not necessary on vacation to quickly drink up your beer on the beach and look for wi-fi to solve the problem.

Competence and responsibility of support staff


On large projects, companies do not skimp on the salaries of developers. Hunt expensive midles or seniors from similar projects. With support, the situation is slightly different. These costs are trying in every way to reduce. Companies hire inexpensive yesterday enikeyschikov and boldly go into battle. Such a strategy is possible if it comes to the website of some plant in Zelenograd.

If we are talking about a large online store, then every hour of inactivity costs more than a monthly salary of the admin-enikeyschika. Take for a starting point 1 billion rubles in annual turnover. This is the minimum turnover of any online store from the TOP-100 ranking for 2018 . We divide this amount by the number of hours per year and we get more than 100,000 rubles of net losses. And if not counting the night hours, then we can safely double the amount.

But money is not the main thing, is it? (no, of course the main thing) There are still reputational losses. The hour of the fall of a well-known online store can cause both a wave of reviews in social networks and publications in thematic media. And the conversations of friends in the kitchen in the style of "Do not buy anything there, their site does not work all the time" are not measurable at all.

Now accountable. In my practice, there was a case when the administrator on duty did not react in time to notifying the monitoring system that the site was unavailable. On a pleasant summer Friday evening and the site of an online store known in Moscow, it was quietly lying. On Saturday morning, the product of this site did not understand why the site did not open, and there was silence in support chats and urgent alerts in Slack. Such an error cost us a six-figure sum, but for this duty officer.

Responsibility is a skill that is difficult to develop. He either has a person or not. Therefore, in interviews I try to reveal its presence by various questions that indirectly show whether a person is used to taking responsibility for himself. If a person answers that he chose a university, because his parents said so or changes jobs because his wife said that he receives little, then it’s better not to get involved with such people.

Interaction with the development team


When on productive in the process of operation, users have simple problems, then support solves them on their own. Tries to reproduce the problem, analyzes the logs and so on. But what to do when a bug surfaced on the prode? In this case, support gets a task for developers and here the most interesting begins.

Developers are constantly overwhelmed. They are creating new features. Fix bugs with the sale of, say, not the most interesting. Deadlines for completing the next sprint. And here come unpleasant support people and say: “Immediately leave everything, we have problems.” The priority of such tasks is minimal.Especially when the problem is not the most critical and the main functionality of the site works, and when the release manager does not run with bulging eyes and does not write: “Urgently bring this task to the next release or hotfix.”

Tasks with normal or low priority go from release to release. To the question “When will the task be completed?” You will receive answers in the style: “Forgive, now there are many tasks, ask team leads or manager’s release.”

Problems in production have a higher priority than creating new features. Bad reviews will not keep you waiting if users constantly bump into bugs. It is difficult to restore a damaged reputation.

DevOps solves the issues of interaction development and support. This abbreviation is often used as a specific person who helps to create test environments for development, builds a CI \ CD pipeline and quickly outputs the tested code to production. DevOps is an approach to software development, when all participants in the process interact closely with each other and help to create and update software products and services faster. I mean analysts, developers, testers and support.

Support and development in this approach are not different departments with their goals and objectives. Development is involved in operation and vice versa. The famous phrase distributed teams: "The problem is not on my side" is not so often flashed in chat rooms, and end users are a bit happier.

Source text: [From the sandbox] Five problems in the processes of operation and support of Highload IT systems