Ceph Monitors are laggy or clock might be skewed

This weekend I got to investigate a Ceph cluster which had issues where the Monitors were constantly performing new elections. After some investigation on of the three monitors was eating 100% CPU on a single core and kept printing this in the logs: mon.charlie@2(peon).paxos(paxos updating c 106399655..106400232) lease_expire from mon.0 [2a00:XXX:121:XXX::6789:1]:6789/0 is 2.380296 seconds in […]

Redundant Ceph monitors with Round Robin DNS

One of the unique features of Ceph is that it can be build without any Single Point of Failure. No single machine will take your cluster down when designed properly. Ceph’s monitors play a crucial part in this. To make them redundant you want a odd number of monitors, where 3 is more then sufficient […]