While we’ve been working with customers using Ceph in a variety of ways, we have encountered some several ways to break your Ceph cluster. In that light, here is an update on five more ways to break your Ceph cluster as a continuation of the original presentation done by Wido den Hollander which is called; 10 ways to break your Ceph cluster ( https://youtu.be/oFDPKqMEb3w ). All of the ways to break your Ceph cluster are based on real events and experiences we have had with our clients and not on clinically based testing or so.
The first way to break your Ceph cluster on the list is: overestimating your automation tool. In the IT world we have been automating everything the last years. However, we see automation tools, work in different ways. They can work well, they can work mediocre, and they can work badly. Regardless, if you do not understand the tool you are automating (in this case Ceph), it’s going to be a problem. What we experienced is that we had a case where a script was ran in front of the automation. A variable that was supposed to have a 3 in it there, was missing. So, it became a 0 instead. Then, because it became a 0 the three monitors were just removed. Everything was confirmed by the script, and it just removed everything. To take it a step further, very thoroughly all the directory structures of monitor databases were also removed. To solve this, we were fortunately able to recreate the monitor database by scraping all the OSD’s and combining all that information into a new working monitor database. We created a new monitor and added additional monitors to that.
In another but similar case a user went into the backend and removed the monitors using the docker commands which resulted in a cluster without monitors. This was an administrative mistake, no blame to ‘cephadm’. After some digging, we were fortunately able to find the original monitor directories. So, we were able to restore monitor functionality. In short, understand what your automation tool is doing, why it is doing that and definitely understand what it shouldn’t do. You should also be aware that your automation tool should help you execute what you already know. If you want to use Ceph, you should understand Ceph. As an example, if you want to use Ceph and ceph-ansible you should understand Ceph and Ansible.
The second way to break your Ceph cluster is an old, but important step. We have encountered it a couple of times. We used to call it: ‘running with size=2’. However, we have revised it to running min_size=1. In short: we still recommend you run with at least three copies. But if you can’t, don’t ever go decrease your min_size to lower than 1, or better yet don’t decrease your redundant objects to ‘0’. Make sure you always write at least 1 redundant object. This is also true for Erasure Coding.
The third way is interesting: not fully completing an update. The documentation of Ceph is sometimes a little bit, let’s say, ‘petit’. There are some cases where in regard to the generic documentation of Ceph the upgrade is fully done. However, if you look at the release notes, a lot of steps are skipped or not even touched. We have had at least four cases in the last couple of years with the upgrades towards Nautilus for instance. The messenger version 2 had already been enabled, yet nobody upgraded the minimum OSD versions (require_osd_release). A lot of daemons are down, a lot of pgs are peering, remapped, stale. Really the cluster is not doing anything useful at this moment. It only takes one setting to adjust, to fix it all, which is setting your minimal OSD version to Nautilus, or Octopus if your cluster miraculously survived into your Octopus upgrade, which sometimes happens.
So, check your clusters:
· ceph versions.
· ceph osd dump|grep require_osd_release.
· ceph mon dump.
The fourth way is completing an update too soon. This is one we have experienced recently. We had a client that was overly enthusiastic and set the ‘auth_allow_insecure_global_id_reclaim‘ setting before they upgraded all the clients and daemons. Which meant that some of the clients and RGW’s could not connect. Please make sure you don’t finish an upgrade too soon and make sure you really follow all the steps.
The fifth step is a big one. Running multiple rbd’s with the same id behind a load balancer. We had a customer who had 9 rbd’s but only configured 3. For each named rbd they had 3 processes running. They were all skipping over which one was active. Behind a load balancer this resulted in many millions of partial uploads that were not finished. Their entire cluster got full. Next, they added new hardware and were still adding new hardware. Luckily, we figured out this was the problem. To solve it we renamed 6 rbd’s, inventoried all the objects and made sure it was all cleaned up.
And a bonus one: blindly trusting your PG autoscaler. We, at 42on, love the PG autoscaler. Its’ a great addition to a Ceph cluster. We have had some trouble in the beginning with it where it seems that users were just installing clusters with default settings. It’s not until you inject data that you really find out that the PG’s weren’t configured well. You are injecting a lot of data but also splitting all your placement groups at the same time, resulting in very poor performance, eventually resulting in unhappy users. There is no real way to fix it unless you take action.
From the original 10 ways to break your Ceph cluster (I recommend you to look this presentation up). There is one that is 100% solved because of Bluestore: mounting XFS with the ‘nobarrier’ option. Because of Bluestore you don’t use XFS anymore. The other 9 however, we sometimes, sadly, still encounter.
Do you know any more ways to break your Ceph cluster from your own experiences? Please let us know in the comments.