The DevOps theme is now on the hype. The continuous integration and delivery pipeline CI/CD
is implemented by everyone who is not lazy. But most do not always pay due attention to ensuring the reliability of information systems at various stages of CI/CD Pipeline. In this article, I would like to talk about my experience in automating software quality checks and implementing possible scenarios for its “self-healing.”
I work as an engineer in the IT services management department at LANIT-Integration
. My profile direction is the implementation of various systems for monitoring application performance and availability. I often communicate with IT customers from different market segments for current issues on monitoring the quality of their IT services. The main task is to minimize the release cycle time and increase the frequency of their release. This, of course, is good: more releases - more new features - more satisfied users - more profits. But in fact, not always everything turns out well. With a very high deployment rate, the question of the quality of our releases immediately pops up. Even with a fully automated pipeline, one of the biggest problems is the transfer of services from testing to production, without affecting the uptime and user interaction with the application.
Following numerous conversations with customers, I can say that the quality control of releases, the problem of the reliability of the application and the possibility of its “self-healing” (for example, rolling back to a stable version) at various stages of the CI/CD pipeline are among the most exciting and relevant topics.
Recently, I myself worked on the side of the customer - in the support service of the applied software of the online bank. The architecture of our application used a large number of self-written microservices. The saddest thing is that not all developers coped with high development rates, the quality of some microservices suffered, which gave rise to ridiculous nicknames for them and their creators. There were stories about the materials from which these products are made.
The high frequency of releases and a large number of microservices make it difficult to understand the operation of the application as a whole, both at the testing and operation stages. Changes occur constantly and it is very difficult to monitor them without good monitoring tools. Often, after a nightly release in the morning, the developers sit like a powder keg and wait for nothing to break, although at the testing stage all checks were successful.
There is one more thing. At the testing stage, the software operability is checked: the execution of the main functions of the application and the absence of errors. Qualitative performance estimates are either missing or do not take into account all aspects of the application and the integration layer. Some metrics may not be checked at all. As a result, when a breakdown occurs in a production environment, the technical support department only learns about this when real users start complaining. I want to minimize the impact of low-quality software on end users.
One of the solutions is to implement software quality control processes at various stages of CI/CD Pipeline, add different scenarios to restore the system in case of accidents. Also remember that we have DevOps.Business expects the fastest possible receipt of a new product. Therefore, all our checks and scripts should be automated.
The task is divided into two components:
- quality control of assemblies at the testing stage (automate the process of trapping substandard assemblies);
- software quality control in the prod environment (mechanisms for automatic detection of problems and possible scenarios for their self-recovery).
Tool for monitoring and collecting metrics
In order to accomplish the tasks, a monitoring system is required that is able to detect problems and transfer them to automation systems at various stages of the CI/CD pipeline. It will also be a positive thing if this system provides useful metrics for various teams: development, testing, operation. And quite wonderful, if for business.
A collection of different systems (Prometheus, ELK Stack, Zabbix, etc.) can be used to collect metrics, but, in my opinion, the APM class solutions ( Application Performance Monitoring
), which can greatly simplify your life.
As part of my work in the escort service, I began to do a similar project using the APM solution from Dynatrace. Now, working in the integrator, I know the market of monitoring systems quite well. My subjective opinion: Dynatrace is best suited to solve such problems.
Dynatrace provides a horizontal view of each user operation with a deep level of detail down to the code execution level. You can track the entire chain of interaction between various information services: from the front-end levels of web and mobile applications, back-end application servers, integration bus to a specific call to the database.
Source . Automatic construction of all dependencies between system components
Source . Automatic detection and construction of the path of the service operation
We also remember that we need to integrate with various automation tools. Here the solution has a convenient API that allows you to send and receive various metrics and events.
Next, we turn to a more detailed consideration of how to solve the tasks with the help of the Dynatrace system.
Task 1. Automation of assembly quality control at the testing stage
The first task is to find problems as early as possible at the stages of the application delivery pipeline. Only “good” code builds should reach the prod environment. To do this, in your pipeline at the testing stage, additional monitors should be included to check the quality of your services.
Consider step by step how to implement it and automate this process:
The figure shows the flow of automated software quality control steps:
Step 1. Deploy monitoring system
- deployment of the monitoring system (installation of agents);
- define your software quality assessment events (metrics and threshold values) and transfer them to the monitoring system;
- load generation and performance tests;
- collection of performance and availability data in the monitoring system;
- Transferring test data based on software quality assessment events from the monitoring system to the CI/CD system. Automated assembly analysis.
First you need to install agents in your test environment. At the same time, the Dynatrace solution has a nice feature - it uses the OneAgent universal agent, which is installed on the OS instance (Windows, Linux, AIX), automatically detects your services and starts collecting monitoring data on them. You do not need to set up a separate agent for each process. A similar situation will be for cloud and container platforms. You can also automate the process of installing agents. Dynatrace fits perfectly into the concept of “infrastructure as a code” ( Infrastructure as code or IaC
): there are already ready scripts and instructions for everything popular platforms. Embed the agent in the configuration of your service, and when it is deployed, you immediately get a new service with an already working agent.
Step 2: Determine your software quality assessment events
Now you need to decide on a list of services and business operations. It is important to take into account exactly those operations of users that are business critical for your service. Here I recommend to consult with business and system analysts.
Next, you need to determine which metrics you want to include in the check for each of the levels. For example, these may be execution time (with division into mean, median, percentile, etc.), errors (logical, service, infrastructure, etc.) and various infrastructure metrics (memory heap, garbage collector, thread count, etc.).
For automation and usability by the DevOps team, the concept of “Monitoring as code” appears. What I mean by this is that the developer/tester can write a simple JSON file that defines the indicators of software quality checks.
Let's look at an example of such a JSON file. Objects from the Dynatrace API are used as a key/value pair (a description of the API can be found here Dynatrace API
The file is an array of definitions of time series (timeseries):
- timeseriesId - metric to check, for example, Response Time, Error count, Memory used, etc .;
- aggregation - the level of aggregation of metrics, in our case avg, but you can use whatever you need (avg, min, max, sum, count, percentile);
- tags - an object tag in the monitoring system, or you can specify a specific object identifier;
- severe and warning - these indicators regulate the threshold values of our metrics; if the test value exceeds the severe threshold, then our assembly is marked as not successful.
The following figure shows an example of using such trasholds.
Step 3. Load Generation
After we determine the quality levels of our service, it is necessary to generate a test load. You can use any of the testing tools that are convenient for you, for example, Jmeter, Selenium, Neotys, Gatling, etc.
The Dynatrace monitoring system allows you to capture various metadata from your tests and recognize which of the tests relates to which release cycle and which service. It is recommended to add additional headers to HTTP test requests.
The following figure shows an example where, using the optional X-Dynatrace-Test header, we mark that this test refers to testing the operation of adding a product to the cart.
When you run each load test, you send additional contextual information to Dynatrace using the event API from the CI/CD server. Thus, the system can distinguish between different tests among themselves.
Source . Event in the monitoring system about the start of load testing
Step 4-5. Collect performance data and transfer data to the CI/CD system
Together with the generated test, an event is sent to the monitoring system about the need to collect data on the verification of service quality indicators. It also indicates our JSON file, which defines key metrics.
Event about the need to check the quality of the software generated on the CI/CD server for sending to the monitoring system
In our example, the quality control event called perfSigDynatraceReport
is ready plugin
for integration with Jenkins, which was developed by guys from T-Systems Multimedia Solutions. Each event on the start of the scan contains information about the service, build number, and time of testing. The plugin collects performance values during build, evaluates them and compares the result with previous builds and non-functional requirements.
Event in the monitoring system about the start of the build quality check. Source
After the test is completed, all software quality assessment metrics are transferred back to the continuous integration system, for example, Jenkins, which generates a report on the results.
The result of statistics on builds on the CI/CD server. Source
For each individual assembly, we see statistics for each metric we specify throughout the execution of the entire test.We also see if there were violations in certain threshold values (warning and severe-trasholds). Based on aggregates, the entire assembly is marked as stable, unstable, or failing. Also, for convenience, you can add in the report indicators of comparison of the current assembly with the previous one.
View detailed statistics on builds on the CI/CD server. Source
Detailed Comparison of Two Assemblies
If necessary, you can go to the Dynatrace interface and there you can view statistics on each of your builds in more detail and compare them with each other.
Comparison of build statistics in Dynatrace. Source
As a result, we get the service “monitoring as a service”, automated in the continuous integration pipeline. The developer or tester only needs to define a list of metrics in the JSON file, and everything else happens automatically. We get transparent quality control of releases: all notifications on performance, resource consumption or architectural regressions.
Task 2. Automation of software quality control in a production environment
So, we solved the problem of how to automate the monitoring process at the testing stage in Pipeline. Thus, we minimize the percentage of poor-quality assemblies that reach the prod environment.
But what to do if the bad software still came to sell, well, or just something breaks. For utopia, we wanted the automatic detection of problems to be present and, if possible, the system itself would restore its working capacity, at least at night.
To do this, we need, by analogy with the previous section, to provide for automatic checks of software quality in the prod environment and lay out scenarios for self-healing of the system.
Auto Correction as a Code
Most companies already have an accumulated knowledge base on various types of common problems and a list of corrective actions, such as restarting processes, cleaning up resources, rolling back versions, restoring incorrect configuration changes, increasing or decreasing the number of components in a cluster, switching blue or green contours and dr.
Despite the fact that these use cases have been known for many years to many teams with whom I communicate, few have thought about it and invested in their automation.
If you think about it, there is nothing too complicated in the implementation of processes for self-healing of the application; you need to present the already known scenarios of your admins in the form of code scripts (the concept of “auto correction as code”) that you wrote in advance for each case. Automatic remediation scripts should be aimed at addressing the root cause of the problem. You set the correct incident response actions yourself.
Any metric from your monitoring system can act as a trigger for running a script, as long as these metrics accurately determine that everything is bad, since you would not want to get false positives in a production environment.
You can use any system or set of systems: Prometheus, ELK Stack, Zabbix, etc. But I will give a few examples based on the APM solution (Dynatrace will again be an example), which will also help make your life easier.
Firstly, there is everything related to performance in terms of the operation of the application. The solution provides hundreds of metrics at various levels that you can use as triggers:
Levels of monitoring in Dynatrace. Source
- user level (browsers, mobile applications, IoT devices, user behavior, conversion, etc.);
- level of service and operations (performance, availability, errors, etc.);
- application infrastructure level (OS host metrics, JMX, MQ, web-server, etc.);
- platform level (virtualization, cloud, container, etc.).
Secondly, as I said earlier, Dynatrace has an open API, which makes it very convenient to integrate it with various third-party systems. For example, sending a notification to the automation system when the control parameters are exceeded.
Below is an example for interacting with Ansible.
Here are some examples of exactly which automation you can do. This is just a part of the cases, their list in your environment can be limited only by your imagination and the capabilities of your monitoring tools.
1. Bad deploy - version rollback
Even if we all check very well in a test environment, there is still a chance that a new release can kill your application in a prod environment. The same human factor has not been canceled.
In the following figure, we see that there is a sharp jump in the time for performing operations on the service. The beginning of this jump coincides with the deployment time of the application. We transfer all this information as events to the automation system. If the service is not normalized after the time set by us, then a script is automatically called that rolls back the version to the old one.
Degradation performance operations after deployment. Source
2. Resource load under 100% - add node to routing
In the following example, the monitoring system determines that one of the components has a CPU load of 100%.
CPU loading 100%
There are several different scenarios for this event. For example, the monitoring system additionally checks whether the lack of resources is associated with an increase in the load on the service. If, yes, a script is executed that automatically adds the node to the routing, thereby restoring the health of the system as a whole.
Scaling under after incident
3. No hard disk space - cleaning the disk
I think that these processes are already automated by many. With the help of APM, you can also monitor the free space on the disk subsystem. If there is no space or a slow disk, call the script to clean or add space.
Disk loading 100%
4. Low user activity or low conversion - switch between blue and green branches
I often meet customers using two contours (blue-green deploy) for applications in the prod environment. This allows you to quickly switch between branches when delivering new releases. Often, after a deployment, cardinal changes can occur that are not immediately noticeable. However, degradation in performance and availability may not be observed. To respond quickly to such changes, it is better to use various metrics that reflect user behavior (number of sessions and user actions, conversion, bounce rate). The following figure shows an example in which switching between software branches occurs when the conversion drops.
Drop in conversion after switching between software branches. Source
Automatic Problem Detection Mechanisms
At the end I will give another example, for which I like Dynatrace the most.
As part of my story about automating the quality control of assemblies in a test environment, we determined all threshold values manually. For a test environment, this is normal, the tester himself determines the indicators before each test, depending on the load. In the prod environment, it is desirable that the problems are detected automatically taking into account various baseline mechanisms.
Dynatrace has interesting built-in artificial intelligence tools that, based on the mechanisms for determining anomalous metrics (baselining) and building interaction maps between all components, comparing and correlating events between themselves, determine the anomalies in your service and provide detailed information on each problem and root cause.
By automatically analyzing the dependencies between the components, Dynatrace determines not only whether the problem service is the root cause, but also its dependence on other services. In the example below, Dynatrace automatically monitors and evaluates the performance of each service as part of the execution of transactions, identifies the Golang service as the root cause.
Example of determining the root cause of a failure. Source
The following figure shows the process of monitoring problems with your application from the start of the incident.
Visualization a problem with the display of all components and events on them
The monitoring system collected a complete chronology of events on the problem. In the window below the timeline, we see all the key events on each of the components. According to these events, you can specify procedures for automatic correction in the form of code scripts.
Additionally, I advise you to integrate the monitoring system with the Service Desk or a bug tracker. When a problem arises, developers promptly receive complete information for its analysis at the code level in the prod environment.
As a result, we have a CI/CD pipeline with embedded automated software quality checks in Pipeline.We minimize the number of poor-quality assemblies, increase the reliability of the system as a whole, and if we still have a system malfunction, we launch mechanisms to restore it.
It is definitely worth the effort to automate the monitoring of software quality, it is not always a quick process, but over time it will bear fruit. I recommend after solving a new incident in the prod environment immediately think about which monitors to add for checking in the test environment in order to avoid getting poorly assembled in the prod, and also create a script to automatically correct these problems.
I hope my examples will help you in your endeavors. I will also be interested to see your examples of the metrics used to implement self-healing systems.