Ceph Monitors are laggy or clock might be skewed

This weekend I got to investigate a Ceph cluster which had issues where the Monitors were constantly performing new elections. After some investigation on of the three monitors was eating 100% CPU on a single core and kept printing this in the logs: mon.charlie@2(peon).paxos(paxos updating c 106399655..106400232) lease_expire from mon.0 [2a00:XXX:121:XXX::6789:1]:6789/0 is 2.380296 seconds in […]

Safely backing up your Ceph monitors

So you might wonder: Why do I need to make a backup of my Ceph monitors? I have multiple monitors. That’s true, but would you run into the very unfortunate situation where you loose all you monitors, you loose all your data. The monitors contain very important metadata (pgmap, osdmap, crushmap) to run your cluster. […]