How to break your Ceph cluster

How to break your Ceph cluster

How it began

In the past we have talked about how to break your Ceph cluster. It began with a talk on ‘’10 ways to break your Ceph cluster’’. This is a presentation that was held by Wido den Hollander, the founder of 42on, during a Ceph conference in Germany. You can watch that talk through the following link: 

It continued with ‘’5 more ways to break your Ceph cluster’’. This is a blog that was written Wout van Heeswijk, 42on’s CTO, as a continuation on the first talk. You can read this blog through the following link:

This time, we have combined the talk and the blog for you in one blog, to give you a clear overview of what the do’s and don’ts are when it comes to your Ceph cluster.

10 ways to break your Ceph cluster

 Here is a summary of the original 10 ways to break your Ceph cluster.

1. Wrong CRUSH failure domain. If you configure CRUSH improperly this might lead to issues. Always do a test on your cluster to verify that failures are handled as intended.

2. Decommissioning a host. An example of such a case is a cluster that is running size = 2, min_ size = 1 and some hardware needed to be replaced. There was only one disk as a copy left for various Placement Groups and by losing that disk the data will be lost. The Ceph cluster will have to be abandoned and rebuild from scratch.

3. Removing ‘log’ files in MON’s data directory. Always make sure monitors have enough disk space and never manually remove files from their data directory.

4. Removing the wrong pool. Double check before removing a pool. Better yet, ask somebody else to take a look at it before removing a pool.

5. Setting the noout flag for a long time. Always aim for a cluster running HEALTH_OK and take a look at the cluster if it’s in HEALTH_WARN for a longer period.

6. Mounting XFS with nobarrier option. This one is 100% solved because of Bluestore. Because of Bluestore you don’t use XFS anymore.

7. Enabling writeback on HBA without BBU. Never turn on writeback caching in your HBA without a Battery Backup Unit present.

8. Creating too many placement groups. Be cautious when creating placement groups. It can harm the cluster when the cluster needs to re-peer all placement groups.

9. Using 2x replication. Imagine a host being taken down for maintenance. A portion of the data now relies on one disk. If this disk fails all the data is lost.

10. Underestimating monitors. It is recommended to use dedicated hardware for monitors.

5 more ways to break your Ceph cluster

And here is a summary of the updated 5 more ways:

11. Overestimating your automation tool. In short, understand what your automation tool is doing, why it is doing that and definitely understand what it shouldn’t do. You should also be aware that your automation tool should help you execute what you already know. If you want to use Ceph, you should understand Ceph. As an example, if you want to use Ceph and ceph-ansible you should understand Ceph and Ansible.

12. Running min_size=1. We recommend you run with at least three copies. But if you can’t, don’t ever go decrease your min_size to lower than 1, or better yet don’t decrease your redundant objects to ‘0’. Make sure you always write at least 1 redundant object. This is also true for Erasure Coding.

13. Not fully completing an update. The documentation of Ceph is sometimes a little bit, let’s say, ‘petit’. There are some cases where in regards to the generic documentation of Ceph the upgrade is fully done. However, if you look at the release notes, a lot of steps are skipped or not even touched. So double check if the update has been completed.

14. Completing an update too soon. Please make sure you don’t finish an upgrade too soon and you really follow all the steps.

15. Running multiple rbd’s with the same id behind a load balancer. We had a customer who had 9 rbd’s but only configured 3. For each named rbd they had 3 processes running. They were all skipping over which one was active. Behind a load balancer this resulted in many millions of partial uploads that were not finished. Their entire cluster got full. Next, they added new hardware and were still adding new hardware. Luckily, we figured out this was the problem. To solve it we renamed 6 rbd’s, inventoried all the objects and made sure it was all cleaned up.

16. Bonus: blindly trusting your PG autoscaler. It’s not until you inject data that you really find out that the PG’s weren’t configured well. You are injecting a lot of data but also splitting all your placement groups at the same time, resulting in very poor performance, eventually resulting in unhappy users. There is no real way to fix it unless you take action.

There you have it. In total 15 ways (plus a bonus) on how to break your Ceph cluster. Did we miss something? Let us know through a message. For more insights about Ceph you can also visit our LinkedIn account through the following link:

We are hiring!
Are you our new