He does not give you a rook

He does not give you a rook


In connection with the growing popularity of Rook, I want to talk about his pitfalls and problems that are waiting for you on the way.

About me: The experience of ceph administration from the version of hammer, the founder of the t.me/ceph_ru community in telegrams.

In order not to be unfounded, I will refer to posts accepted by Habr (judging by the rating) about problems with ceph. With most of the problems in these posts, I also encountered. Links to the material used at the end of the post.

In a post about Rook, we mention ceph for a reason - Rook is essentially ceph wrapped in kubernetes, which means it inherits all its problems. With ceph problems and begin.

Simplify cluster management


One of the advantages of Rook is the convenience of managing ceph through kuberentes.

However, ceph contains more than 1000 parameters to configure, at the same time, through the rook, we can edit only a smaller part of them.
Luminous Example
& gt; ceph daemon mon.a config show | wc -l
1401
Rook is positioned as a convenient way to install and update ceph
There are no problems with installing ceph without Rook - the ansible playbook is written in 30 minutes, but there are a lot of problems with updating.

Quote from the post Krok

Example: incorrect crush tunables after upgrading from hummer to jewel

& gt; ceph osd crush show-tunables
{
...
  "Straw_calc_version": 1,
  "Allowed_bucket_algs": 22,
  "Profile": "unknown",
  "Optimal_tunables": 0,
...
}
But even within minor versions there are problems.

Example: Update 12.2.6 brings the cluster to health err and conditionally broken PG
ceph.com/releases/v12-2-8-released

Not updated, wait and test? But we kind of use Rook for the convenience of upgrades including.

The complexity of the disaster recovery cluster in the Rook


Example: OSD is falling down with errors under its feet. You suspect that the problem is in one of the parameters in the config, you want to change the config for a specific daemon, but you cannot, because you have kubernetes and DaemonSet.

There is no alternative. ceph tell osd.Num injectargs does not work - the OSD also lies.

Difficulty debug


For some settings and performance tests, you must connect directly to the osd daemon socket. In the case of Rook, you must first find the container you need, then go into it, find the missing container for debug and very upset.

The difficulty of raising OSD consistently


Example: OSD falls on OOM, rebalance begins, then the following drops.

Solution: Raise the OSD one at a time, wait for its full inclusion in the cluster and raise the following. (Read more in the Ceph report. Anatomy of a catastrophe).

In the case of baremetal installations, this is done by hand, in the case of a Rook and a single OSD, there are no problems with the node node, problems with alternate elevation occur if OSD & gt; 1 on the node.

Of course, they are solvable, but we carry the Rook to simplify, and we get a complication.

Difficulty in setting limits for ceph demons


For baremetal ceph installations, it is easy enough to calculate the necessary resources per cluster - there are formulas and there are studies. When using weak CPUs, you still have to run a series of performance tests, find out what Numa is, but it's still more simple than in Rook.

In the case of a Rook, in addition to the memory limits that you can calculate, the question of setting a CPU limit arises.

And here you have to sweat with performance tests. In the case of lowering the limits, you will get a slow cluster, in the case of setting unlim you will get active use of the CPU when rebalancing, which will badly affect your applications in kubernetes.

Networking Problems v1


For ceph it is recommended to use 2x10gb network. One for client traffic, the other for business needs ceph (rebalance). If you live with ceph on baremetal, then this separation is easily configured; if you live with Rook, then the separation by networks will cause you problems, due to the fact that not every cluster config allows you to submit two different networks to the pod. < br/>

Network Connection Problems v2


If you refuse to separate the networks, then if you rebalance the traffic ceph will clog the entire channel and your applications in the kubernetes will slow down or fall. You can reduce the ceph rebalance rate, but then due to the long rebalance you get an increased risk of a second node falling out of the cluster on disks or OOM, and there already guaranteed read only to the cluster.

Long rebalance - long application brakes


Quote from a Ceph post. Anatomy of a catastrophe.
Test cluster performance:

A 4 KB write operation takes 1 ms, a performance of 1000 operations/second in 1 stream.
An operation of 4 MB in size (object size) takes 22 ms, a performance of 45 operations/second.

Consequently, when one of the three domains fails, the cluster is in a degraded state for some time, and half of the hot objects spread to different versions, then half of the write operations will begin with a forced restore.

The compulsory recovery time is calculated approximately - write operations to the degraded object.

First we read 4 MB for 22 ms, we write 22 ms, and then 1 ms we write 4 Kb of data itself. A total of 45 ms per one write operation to a degraded object on the SSD, when we had a nominal performance of 1 ms - a 45-fold drop in performance.

The more we have the percentage of degraded objects, the more terrible things become.
It turns out that the speed of rebalance is crucial for the correct operation of the cluster.

Server specific settings for ceph


ceph need a specific host tuning.

Example: the sysctl settings are the same JumboFrame, some of these settings may negatively affect your payload.

The real need for a Rook is questionable


If you are in the cloud you have storage from your cloud provider, which is much more convenient.

If you are on your servers, then ceph management will be more convenient without kubernetes.

Do you rent a server in some low cost hosting? Then you will have a lot of fun with the network, its delays and bandwidth, which clearly has a negative effect on ceph.

Total: Implementing kuberentes and introducing storage is different tasks with different introductory and different solutions - to mix them means to make a possibly dangerous trade-off in favor of one or another. Combining these solutions will be very difficult even at the design stage, and there is still a period of operation.

References:

Post # 1 But you say Ceph ... is it really good?
Post # 2 Ceph. Anatomy of a catastrophe

Source text: He does not give you a rook