You might be asking yourself, why talk about QD=1 bs=4k? Well, from the experience of Wido den Hollander, our colleague at 42on, a single thread IO latency is very important for many applications. What we see with benchmarks is that people focus on high amounts of bandwidth and large amounts of IOps. They simply go multi-threaded with QD and then say they reached millions IOps with Ceph. However, in the end if you look at the latency with a single IO it could be pretty high. For example, a PHP web server, MariaDB database or Redis cache when flushing to its disk, are all doing single threaded IO. So, the latency of that single threaded IO starts to matter. And that’s when you notice how snappy applications feel by using a QD of 1. Therefore, it’s highly advised to do all the benchmarking with QD=1 bs=4k. From there you can start increasing the QD and BS to get more information on the performance of the cluster. In short, it all starts with QD=1 bs=4k.
Moving on to a related subject about low latency in Ceph. What you should know is that Ceph itself will never provide you with the lowest latency possible. This is because Ceph was designed for other things than latency which are redundancy, scalability and data safety. If you take a local NVME and put it on your laptop or server it will get a way better latency than Ceph will every provide. The reason for this is because it needs to go over the network, over TCPIP, then it goes to the CPU, CPU does its thing and eventually the code which runs in the CPU of Ceph does it’s thing. Later it rides to three nodes. In this case we are usually replicating two or three times. So, writing a block in Ceph will be slower than different types of storage.
Ceph provides you with three major properties: redundancy, scalability and data safety. At 42on we always say we have never seen Ceph lose data, because there is always something else that’s happening, like hardware failure. Ceph itself cares about your data. The safety of your data is the number one priority for Ceph and performance the third. So, what can we achieve out of Ceph in terms of IOps? For this, Wido did benchmarking with FIO. Simple configuration. He took the IO engine RBD and then used the pool RBD. In one case there was an image called FIO1. Make sure you run these test multiple times. You have to prepopulate the RBD image by running the test a couple of times and then you simply say you have an IO def of 1 blocksize of 4K. Eventually he ran the test and after 60 seconds it said how fast or low the latency is of the Ceph system.
Moving on to hardware setup. The following tests were also conducted by Wido. He took some SuperMicro systems with an AMD Epic 16C CPU, 256 GB of memory, 4 Samsung PM SSD’s and 100 Gbit networking of Mellanox. A few things to mention here is that the main performance gain you’re going to get is pinning your CPU C-state to 1. That’s a kernel parameter. If you look it up on Google, you can find how you can tune it and set the performance profile of the CPU’s to performance. That means the CPU will run on it’s maximum Gigahertz. In this case that is 2.4 Gigahertz. With it you will get the lowest latency from the code possible.
In this case having 100 gig networking doesn’t matter. The amount of bandwidth used in this test is a few megabits, not gigabits per second. 25 gig networking works, 10 works but 10 is slightly slower. Most of the time Ceph spends in the CPU is where the code is running. So, the C state pinning and profiling of the CPU might matter. Software wise what was used was Ubuntu 18.04, Ceph version v15.2.8 and all the logging was turned off with debug_X = 0 etc.
Now what can we achieve? 1364 IOps Wido was able to get with this hardware. That is a write latency of 0,73 milliseconds for a 4K block being written to 3 nodes at the same time. This includes all the replication. So, including the block that has just been written to three 3 different NVME’s with 3 different nodes, within in 1 millisecond. All in all, that’s a fairly good performance if you look at the perspective of which Ceph comes from. For instance, we replicate over the network, it’s a distributed file system which can scale out and we still get a decent performance.
But we always want more. So, what can the future bring us? The Ceph crimson project for redesigning the OSD should provide us better latency. Yet, Crimson itself is at the moment not focused at providing lower latency, they’re just revisiting the code. In the future it should provide lower latency and then we have the RBD persistent write back cache which uses a local NVME inside a hypervisor to cache IO’s. Wido tested this with v16.2.4 but is has not been stable enough to provide real results. He did see an increase of about 3500 IOps with a much better write latency. He will revisit this in a later stage when the code is more stable. However, if you want to get to this; faster CPU’s, higher clock CPU’s and it’ll gain you more benefits in terms of latency than more cores. Going back to the hardware, the reason Wido chose for 16 cores is because this specific SuperMicro system can have 10 NVME, so you than have 16 cores on 10 NVME. You could also say if there was a CPU with 10 cores or 8 cores, then maybe you could go with 8 cores and even higher CPU’s that will bring down the latency. But it will not give you more IO for the total cluster because that still relies on the number of cores.
So, it’s all a balance. If you’re looking for lower latency you need faster CPU’s and if you need more total amount of IO’s for the system, then you need more CPU cores in the whole cluster.