When we work on Ceph storage infrastructures, we always ask to run a diagnostic tool first. For running these diagnostics, we have developed a tool, which is called Ceph-Collect. The tool is used to gather information from a Ceph cluster. After it runs, it shows an in dept report about the storage architecture together with build and setup details. This tool was written by 42on to help our clients more quickly and efficiently.
The tool itself is very broad in usage. We use it for general issues, consultancy requests, architecture designs, health checks, emergency cases, et cetera. For instance, when an emergency or issue occurs within a cluster, the first thing that we request is an analysis done by Ceph-Collect. So, before meeting with your team we will first go through the report that Ceph-Collect provided us. By doing this we are able to make an estimate of what the problem is.
Apart from using Ceph-Collect in a reactive way as described above, we also use it in a proactive manner. For example, when we add new clients to our support services, we want to get to know the clusters first to see if they are healthy and performing well. We also experience that if we run the diagnostics a couple of times per year, we can often proactively prevent outages.
When it comes to the usage of the tool, the first step is to download the tool from Github. You can either clone the Git repository or only download the tool via the command line:
- curl -SL https://raw.githubusercontent.com/42on/ceph-collect/master/ceph-collect|python
To run the tool, the following requirements need to be met on the client you run the Ceph-Collect on:
- Python 2.7 or higher
- python-rados, ceph and ceph-common RPM or DEB installed
- client.admin keyring present in /etc/ceph
- /etc/ceph/ceph.conf configured to connect to Ceph cluster
- You can test this by running the command: ceph health
This should output either HEALTH_OK, HEALTH_WARN or HEALTH_ERR.
If the tool does not work, there are also different ways to run it; look up the installation instructions on the Github page: https://github.com/42on/ceph-collect#usage
Usually, the tool finishes within a few seconds, but if a Ceph cluster is experiencing issues it might take up to 5 minutes.
The output it will print while running is:
root@mon01:~# ./ceph-collect –debug
DEBUG:root:Using Ceph configuration file: /etc/ceph/ceph.conf
DEBUG:root:Setting client_mount_timeout to: 10
DEBUG:root:Connecting to Ceph cluster
DEBUG:root:Using temporary directory: /tmp/tmpMpFk3n
INFO:root:Gathering overall Ceph information
INFO:root:Gathering Health information
INFO:root:Gathering MON information
INFO:root:Gathering OSD information
INFO:root:Gathering PG information
INFO:root:Gathering MDS information
INFO:root:Outputted Ceph information to /tmp/ceph-collect_20160729_150304.tar.gz
DEBUG:root:Cleaning up temporary directory: /tmp/tmpMpFk3n
After the tool finishes, a ‘tarball’ will be placed in /tmp containing all the information. This tarball should be just a few kilobytes in size. For example: /tmp/ceph-collect_20210901_085930.tar.gz. Send this tarball to email@example.com for analyses.
Once all information is gathered, the output from the tool is used to assist customers in case of questions, support or emergency situations. The ‘tarball’, the tool creates contains vital information for our engineers to support our customers. However, the tool does NOT collect any user (object) data contents nor authentication credentials from a Ceph cluster.
So, Ceph-Collect is the tool which helps us to start working together with your teams to ensure stability or identify issues. If you are having problems, we can help you in two ways. The first option is that you solve the issue yourself with the help of 42on. In this case we will explain how to solve the issue and walk you through it. The second option is that we solve the issue ourselves directly on your cluster. For this option we will ask if it’s possible to have access to the cluster.
The tool is free to use and licensed as GPLv2. If you would like to have us have a look at your Ceph cluster(s) to see if it is built correctly and how it performs, send me a message to get in touch.