April 3rd, 2019 I posted a Linkedin article about the basics of Ceph, a cloud storage platform often used in combination with OpenStack. Since then, I got a lot of questions to explain a bit more of the mechanics of Ceph. As Fairbanks also incorporated Ceph service provider 42on, I started to dive a little deeper into the subject. In this article I share my understandings of the techniques to answer some of your questions. I hope it helps you understand more about Ceph and why it is such an interesting storage platform. If you have any questions, please feel free to comment or e-mail me. For truly in depth understanding of Ceph I will get you in contact with our new colleagues of 42on.
What does “Ceph is scalable object, block and file system storage” mean?
As explained in my post in april, Ceph is a software defined storage solution that can scale both in performance and capacity. Ceph is used to build multi petabyte storage clusters. The basic building blocks of a Ceph storage cluster are the storage nodes. These storage nodes are commodity servers containing commodity hard drives and/or flash storage. Ceph is ´self healing´ and provides infinite scalability and redundancy and is able to grow in a linear way physically and financially. Financial scalability means that you invest in the amount of storage you need at this moment, not the amount you might need over, for example, five years.
Ceph is designed for scale. And you scale by adding additional storage nodes. You will need multiple servers to satisfy your capacity, performance and resiliency requirements. And as you expand the cluster with extra storage nodes, the capacity, performance and resiliency (if needed) will all increase at the same time.
You don’t need to start with petabytes of storage though. You can actually start small, with just a few storage nodes and expand as your needs increase. Because Ceph manages redundancy in software, you don’t need a RAID controller, therefore a generic server is sufficient. The hardware is simple and the intelligence resides all in software. These servers can exist from different hardware brands and/or generations, so you can expand your Ceph environment at your own pace. Alltogether this means that the risk of hardware vendor lock-in is mitigated. You are not tied to any particular proprietary storage hardware.
What makes Ceph so special?
At the heart of the Ceph storage cluster is the CRUSH algoritm, developed by Sage Weil, the co-creator of Ceph. The CRUSH algoritm allows storage clients to calculate which storage node needs to be contacted for retrieving or storing data. The storage client can determine what to do with data or where to get it.
Ceph is unique because there is no centralised ‘registry’ that keeps track of the location of data on the cluster (metadata). Such a centralised registry can become a performance bottleneck, preventing further expansion, or a single-point-of-failure. This is why Ceph can scale in capacity and performance while assuring availability. At the core of the CRUSH algoritm is the CRUSH map. That map contains information about the storage nodes in the cluster and the rules for storing data. That map is the basis for the calculations the storage client needs to perform in order to decide which storage node to contact.
The CRUSH map is distributed across the cluster from special servers: the ‘monitor’ nodes. Those nodes are contacted by both the storage nodes and the storage clients.
It’s important to keep in mind that while the Ceph monitor nodes are an essential part of your Ceph cluster, they are not in the data path. They do not store or process client data.They only keep track of the cluster state for both clients and individual storage nodes. Data always flows directly from the storage node towards the client and vice versa.
So there is no central bottleneck
A storage client will contact the appropriate storage node directly to store or retrieve data. There are no components in between, except for the network, which you will need to size accordingly. Because there are no intermediate components or proxies that could potentially create a bottleneck, a Ceph cluster can really scale horizontally in both capacity and performance. And while scaling storage and performance, data is protected by redundancy.
How does Ceph provide data redundancy?
To have the most redundant and safe storage infrastructure Ceph provides both replication and erasure encoding. For replication Ceph distributes copies of the data and assures that the copies are stored on different storage nodes.
You are able to configure an infinite amount of replicas. The only downside of storing more replicas are the costs of extra hardware you need to setup to provide the extra raw storage capacity. You may decide that data durability and availability are so important that you may have to sacrifice space and absorb the cost, but in general Ceph advises 3 replica’s as a minimum replica count.
Does Ceph also support erasure encoding?
So what if you think having 3 replicas is too costly? How does Ceph ensure your data?
To explain this technique it would be easy to comparing it to RAID technologies. In that case, I would say that RAID1 resembles the Ceph equivalent of ‘replication’: they offer the best overall performance both are not most storage space efficient. Especially as you need more than one replica of the data to achieve the level of redundancy you need.
This is why we got to RAID5 and RAID6 in the past as an alternative to RAID1 or RAID10. Parity RAID assures redundancy but with much less storage overhead. As always in IT, this comes at a price though: in this case at the cost of storage performance (mostly write performance). Ceph and RAID 5 and 6 both use a type of ‘erasure encoding’ to achieve comparable results. In this example of erasure encoding you are telling Ceph to chop up the data in 8 data segments and 4 parity segments:
You will have only 33% storage overhead for redundancy instead of 50% (or even more) you may face using replication, depending on how many copies you want. This example does assume that you have at least 8 + 4 = 12 storage nodes. But any scheme will do, you could do 6 data segments + 2 parity segments (comparable to RAID6) with only 8 hosts.
What failure domains does Ceph protect against?
Ceph is datacenter aware; the CRUSH map can represent your physical datacenter topology, consisting of racks, rows, rooms, floors, datacenters and so on. You can customise your topology. This allows you to create very clear data storage policies that Ceph will use to assure that the cluster can tollerate failures across certain boundaries. An example of a Ceph infrastructure:
If you want, you can be protected to lose a whole rack. Or a whole row of racks and the cluster could still be fully operational, although performance and capacity are reduced. That much redundancy may cost so much storage that you may not want to employ it for all of your data. That’s no problem. You can create multiple storage pools that each have their own protection level and thus cost.
What is this Object Storage Daemon (OSD) I always read about?
If you read about Ceph, you read a lot about the OSD. This is a service that runs on the storage node. The OSD is the actual workhorse of Ceph, it serves the data from the hard drive or ingests it and stores it on the drive. The OSD also assures storage redundancy, by replicating data to other OSDs based on the CRUSH map. So for every hard drive or solid state drive in the storage node, an OSD will be active. A Ceph environment with 24 hard drives, runs 24 OSDs.
When a drive goes down, the OSD will go down too and the monitor nodes will redistribute an update CRUSH map so the clients are aware and know where to get the data. The OSDs also respond to this update, because 1 replica of some data is lost, they will start to replicate affected data to make it redundant again (across fewer nodes though). After this automatic process the is fully healthy again. This is comparable to having a ‘hot-spare’ without the need for ‘hot-spares’.
When the drive is replaced, the cluster will revert back to the original state. This means that the replaced drive will be filled with data once again to make sure data is spread evenly across all drives within the cluster.