[From the sandbox] As I wrote my monitoring

[From the sandbox] As I wrote my monitoring

I decided to share my story. It may even be useful to someone like a budget solution to the well-known problem.

When I was young and hot and did not know where to put my energy, I decided to grind a little. I managed to quickly fill the rating and I found a couple of regular customers who asked to maintain their server on an ongoing basis.

The first thing I thought about was the need for monitoring. I decided to make smart people not to reinvent the wheel, but to see ready-made options, such as Munin or Zabbix. But it was immediately discovered that the Web version requires a good Internet connection, especially if opened for the first time from the phone. If you relax in nature away from the city, it is difficult to get a stable connection. Therefore, the console monitoring option was selected.

As a console monitoring, atop and the program for reading atop’s logs, atopsar, helped me a lot. They were already mentioned on habr, atop was even taken apart , but almost nothing was said about atopsar.


Very simple installation, only three teams.


  yum install atop  

# Debian/ubuntu

  apt-get install atop  

Next, you can customize the monitoring job for yourself or use the default settings.

# Debian/Ubuntu/Centos


Standard file:

 INTERVAL = 60 # The time after which the load is taken in seconds, by default every 10 minutes
 LOGPATH = "/var/log/atop" # Path to the log storage folder
 OUTFILE = "$ LOGPATH/daily.log" # Name of the log file for today

Add to autorun
# Debian/Ubuntu/Centos

  systemctl enable atop  

Run atop as a daemon
# Debian/Ubuntu/Centos

  systemctl start atop  

For the lazy gathered in one team

  yum install atop & amp; & amp;  systemctl enable atop & amp; & amp;  systemctl start atop  

# Debian/ubuntu

  apt-get install atop & amp; & amp;  systemctl enable atop & amp; & amp;  systemctl start atop  


Along with atop, atopsar is also installed, this is a convenient console analyzer of binary logs that are run by the atop daemon. Of course, you can also read logs by atop, but this is not so convenient if you want to capture a large interval of time.

A small educational program for work atopsar.

When you start atopsar without keys, the log opens today and the load on each core is displayed separately and the idl line for all cores.

The keys that I use are:

-A = remove all information from the log
-c = display information on CPU load, default key
-m = RAM load and swap
-d = disk activity
-O = top 3 CPU load processes
-G = Top 3 RAM load processes
-D = Top 3 disk load processes
-N = top-3 network load processes
-r = specify the path to the log you want to read, if you need to look at the load over the past days
-b = the time from which to start output
-e = the time at which to finish the output
-M = creates an extra column at the end, which marks the criticality of the row (+ is the load, * is the critical load)

Thanks to monitoring, we can understand the reason for incorrect behavior of the server at any time.


So, there is a load monitoring, but it still does not give the ability to quickly find and solve problems. We need notifications about the problem.

I’m the one watching the servers, so I need to notify where I can always see it and at least somehow react to it.

In the beginning there were SMS - quickly, securely, for free. But then the mobile operators covered the free SMS mailing through their gateways.
Mail - long, there may be problems with delivery.
Messengers - you need to put on the phone, you need to create bots.

As a result of the search, the Telegram messenger was chosen for simplicity and convenient application on the phone and desktop.

Created your bot using botfather .
After putting on the server several scripts that track server load (IDL, smartct, etc..l), the presence of errors like “oom killer”, errors when creating a backup and other operations that need to be monitored.

The scripts are fairly simple, written in bash, for example, checking LA and notifying that the server is overloaded by Averaging the number of cores.

  if [$ {LA [0]} -gt 2000] ||  [$ {LA [1]} -gt 3000] ||  [$ {LA [2]} -gt 4000]
  wget -O/dev/null "https://api.telegram.org/$bot_id:$bot_key/sendMessage?chat_id=$chat_id&text=On $ ip LA $ LAd server"
  wget -O/dev/null "https://api.telegram.org/$bot_id:$bot_key/sendMessage?chat_id=$chat_id&text=`top -b -n 1 | grep Cpu`"
  wget -O/dev/null "https://api.telegram.org/$bot_id:$bot_key/sendMessage?chat_id=$chat_id&text=Top 5 processes` top -b -n 1 | grep -A 5 'PID  USER '| tail -5` "

The simplicity of the syntax gives a lot of use cases (and anyone who knows a little programming language can write/append).

The only caveat is that if the server is located in Russia (and you do not have IPv6 on the server), then you need to use a proxy. To do this, at the beginning of the script you need to register the connection string to the proxy:

  export https_proxy = http://login: password@IP.address: port  

This is not the end

You go quietly over the mountains with a backpack on your back, rest from civilization, and then the phone, having accidentally caught the connection, throws a notification about the problem that has arisen on your server. What to do? A serene mood like a wind blew away. Call your wife and dictate the command? Haha!

It was necessary to urgently think of some way to eliminate the problems that arose quickly and without a good Internet. Here I was again saved by an instant messenger (# telegrammzhivi). I taught my bot to communicate only with me, ignoring everyone else. Now, along with the notification of the problem, I receive a little more data on which I understand who the source of the problem is, and I can try to solve it remotely. It is enough just to write a message to the bot, throw the phone higher, so that this message is gone, and voila - the bot went to do your work. This way I can, for example, kill some objectionable process, restart the daemon, block IP and so on.

Here I transferred future necessary requests from clients, for example, urgent resetting passwords to users (for “Aaaa, we can't get to the server, we lose millions!”), Search for a user who has access to the right folder, turn the site on and off and other . Of course, I am constantly refining the functionality of the bot, as the fantasy of customers throws up sometimes unexpected requests that I haven't provided for. But the main ones are satisfied.

There is also a version for VK, but it somehow did not catch on.

Now I calmly travel and study this world, without fear that something will break there, but I will not be able to find out or fix it.

Source text: [From the sandbox] As I wrote my monitoring