Running stateful workloads on Kubernetes with Rook Ceph

Date: 2023-12-26

The source code for this lab exercise is available on GitHub.

In our last article Investigating a failed VolumeSnapshot with NFS on Kubernetes, we saw how using NFS as a storage backend for stateful workloads on Kubernetes is not suitable in a production context due to fundamental limitations of NFS itself. So how are we to run our stateful applications on Kubernetes, if at all?

A common deployment model for running stateful applications on Kubernetes is the cloud native hybrid architecture where stateful components of an application such as a database or object storage run on virtual machines (VMs) for optimal stability and performance, while stateless components such as the frontend web UI or REST API server run on Kubernetes for maximal resiliency, elasticity and high availability. While this deployment model combines the best of both worlds, configuration is complex compared to deploying the entire application in-cluster since the stateless components running on Kubernetes must be explicitly configured to point to the out-of-cluster stateful components instead of leveraging the native service discovery mechanisms offered by Kubernetes.

Another option is to leverage a Kubernetes-native distributed storage solution such as Rook Ceph as the storage backend for stateful components running on Kubernetes. This has the benefit of simplifying application configuration while addressing business requirements for data backup and recovery such as the ability to take volume snapshots at a regular interval and perform application-level data recovery in case of a disaster.

Rook Ceph is a CNCF Graduated project. As its name might suggest, Rook Ceph is comprised of two major components: Ceph and Rook. Ceph is the distributed storage system itself while Rook is the Kubernetes operator for automated setup and lifecycle management of Ceph clusters which greatly simplifies the deployment and administration of Ceph clusters on Kubernetes.

In the lab to follow, we’ll quickly provision a 3-node kubeadm cluster (1 master, 2 workers) on the cloud provider of your choice using an automation stack comprised of OpenTofu and Ansible, then deploy Rook Ceph using the official Helm charts and confirm that we are now able to successfully create CSI volume snapshots from PVCs by reusing the MinIO example from our last article.

Lab: Deploying Rook Ceph to a 3-node kubeadm cluster

This lab has been tested with Kubernetes v1.29.0 (Mandala).

Prerequisites

Familiarity with Kubernetes cluster administration is assumed. Furthermore, it is assumed that you are already familiar with Kubernetes CSI and related API objects such as VolumeSnapshots and VolumeSnapshotClasses. If not, it is recommended to follow through the lab at Investigating a failed VolumeSnapshot with NFS on Kubernetes before attempting this lab.

Setting up your environment

A Unix/Linux environment is required. If on Windows, make sure WSL2 is installed and follow the lab within WSL2.

The automated 3-node kubeadm cluster setup is comprised of 2 components: OpenTofu and Ansible. OpenTofu is an open source drop-in replacement for the infamous Terraform infrastructure-as-code (IaC) tool governed by the Linux Foundation, while Ansible is an automated scripting and configuration management tool. We’ll use OpenTofu to provision the necessary infrastructure on the cloud provider of your choice (currently AWS and Alibaba Cloud are supported), then switch over to Ansible for running the scripts required to set up a bare-bones Kubernetes cluster on our cloud infrastructure.

In case you do not have an AWS or Alibaba Cloud account and do not wish to create one, it is still possible to follow through this lab by manually provisioning the required infrastructure on the cloud provider of your choice, on-premises as VMs or on bare metal:

3 nodes running Ubuntu 22.04 LTS
At least 8 vCPUs per node
At least 32 GiB memory per node
At least 16 GiB for the system disk per node
At least 1 unpartitioned, unformatted 64 GiB data disk per node
SSH with public key authentication must be enabled for all nodes (or customize the Ansible playbook accordingly)
All nodes must have an account with the same username and SSH key (or customize the Ansible playbook accordingly)
Passwordless sudo must be configured (or customize the Ansible playbook accordingly)

You are then free to skip the step involving OpenTofu and invoke ansible-playbook directly, after customizing the 2 files detailed below with configuration values matching your own infrastructure.

`ansible/ansible.cfg`

[defaults]
inventory = ./hosts.yaml
remote_user = ubuntu
private_key_file = /path/to/your/key.pem
host_key_checking = False

`ansible/hosts.yaml`

masters:
  hosts:
    master0:
      ansible_host: x.x.x.x
      private_ip: x.x.x.x
workers:
  hosts:
    worker0:
      ansible_host: x.x.x.x
    worker1:
      ansible_host: x.x.x.x

The rest of the instructions will assume an AWS environment, though note that Alibaba Cloud is also supported with minimal modification, namely by setting the following environment variables:

export CLOUD_PROVIDER="aliyun"
export TF_VAR_aliyun_access_key="XXXXXXXXXXXXXXXXXXXXXXXX" # replace me!
export TF_VAR_aliyun_secret_key="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" # replace me!

Setting up your AWS account

You’ll need to set up your AWS account and create an IAM administrator, then generate an access key and secret key for your IAM administrator for setting up AWS CLI v2 in our next step. Consult the official AWS documentation if in doubt.

Setting up AWS CLI v2

First make sure ~/.local/bin/ exists and is in your PATH so sudo is not required for installing the various command-line tools in this lab on your local laptop / desktop / workstation.

mkdir -p "$HOME/.local/bin/"
echo "export PATH=\"\$HOME/.local/bin:\$PATH\"" >> "$HOME/.bashrc"
source "$HOME/.bashrc"

Now download AWS CLI v2 from the official website and install it using the provided script:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip -n awscliv2.zip
./aws/install --bin-dir "$HOME/.local/bin/" --install-dir "$HOME/.local/aws-cli/"

You’ll also need to configure AWS CLI v2 with your access and secret keys when prompted:

aws configure

Confirm that the setup is functional:

aws ec2 describe-instances

Sample output:

{
    "Reservations": []
}

Installing OpenTofu

Now install OpenTofu using the official release binaries. The latest version is 1.6.0-rc1 at the time of writing.

wget https://github.com/opentofu/opentofu/releases/download/v1.6.0-rc1/tofu_1.6.0-rc1_linux_amd64.zip
unzip -n tofu_1.6.0-rc1_linux_amd64.zip
mv tofu "$HOME/.local/bin/"

Confirm that OpenTofu is correctly installed:

tofu -version

Sample output:

OpenTofu v1.6.0-rc1
on linux_amd64

Installing Ansible

The recommended installation method is via pipx which needs to be first installed.

python3 -m pip install --user pipx
python3 -m pipx ensurepath

Now install Ansible with pipx. The latest version is 2.16.2 at the time of writing.

python3 -m pipx install --include-deps ansible

Confirm that Ansible is correctly installed:

ansible-playbook --version

Sample output:

ansible-playbook [core 2.16.2]
  config file = None
  configured module search path = ['/home/donaldsebleung/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/donaldsebleung/.local/share/pipx/venvs/ansible/lib/python3.10/site-packages/ansible
  ansible collection location = /home/donaldsebleung/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/donaldsebleung/.local/bin/ansible-playbook
  python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/home/donaldsebleung/.local/share/pipx/venvs/ansible/bin/python)
  jinja version = 3.1.2
  libyaml = True

Deploying the automation stack

Clone the project repository and make it your working directory.

git clone https://github.com/DonaldKellett/kubeadm-1m2w.git
cd kubeadm-1m2w/

Now invoke OpenTofu to provision the necessary Amazon EC2 instances. Note that Rook Ceph has nontrivial resource requirements so 8 vCPUs and 32 GiB of memory per node are required, which is satisfied by the t3.2xlarge instance type.

export CLOUD_PROVIDER="aws"
export TF_VAR_instance_type="t3.2xlarge"
tofu -chdir="opentofu/${CLOUD_PROVIDER}/" init
tofu -chdir="opentofu/${CLOUD_PROVIDER}/" apply

Answer yes when prompted. Once the provisioning is complete, make note of the k8s-master0-public-ip - you’ll need this to SSH into the master node for running kubectl commands.

Now wait a few seconds for the instances to stabilize and run the provided Ansible playbook which should take no longer than 5 minutes to complete:

export ANSIBLE_CONFIG="${PWD}/ansible/ansible.cfg"
ansible-playbook "${PWD}/ansible/playbook.yaml"

Checking that everything is set up correctly

Now SSH into the master node using the public IP you noted earlier and check that all 3 nodes are up and ready:

kubectl get nodes

Sample output:

NAME      STATUS   ROLES           AGE     VERSION
master0   Ready    control-plane   3m43s   v1.29.0
worker0   Ready    <none>          102s    v1.29.0
worker1   Ready    <none>          102s    v1.29.0

Let’s also take a look at the disks available to the master node (also available to each worker node):

sudo lsblk -f

Sample output:

NAME         FSTYPE FSVER LABEL           UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
loop0                                                                                0   100% /snap/amazon-ssm-agent/7628
loop1                                                                                0   100% /snap/core18/2812
loop2                                                                                0   100% /snap/core20/2015
loop3                                                                                0   100% /snap/lxd/24322
loop4                                                                                0   100% /snap/snapd/20290
nvme1n1                                                                                       
nvme0n1                                                                                       
├─nvme0n1p1  ext4   1.0   cloudimg-rootfs 9e71e708-e903-4c26-8506-d85b84605ba0     11G    28% /
├─nvme0n1p14                                                                                  
└─nvme0n1p15 vfat   FAT32 UEFI            A62D-E731                              98.3M     6% /boot/efi

Notice that the 64 GiB data disk nvme1n1 is unformatted and unpartitioned. This is important as our Ceph cluster will use this data disk later (on each node) for providing storage to our stateful Kubernetes applications.

All subsequent commands in this lab should be executed on the Kubernetes master node unless otherwise instructed.

Installing Rook Ceph

Before installing Rook Ceph, we’ll need to install a few dependencies:

The VolumeSnapshot, VolumeSnapshotContent and VolumeSnapshotClass CRDs
Helm
The CSI snapshot controller

Install the CRDs:

kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml

Confirm the CRDs are correctly installed:

kubectl api-resources | grep volumesnapshot

Sample output:

volumesnapshotclasses             vsclass,vsclasses   snapshot.storage.k8s.io/v1        false        VolumeSnapshotClass
volumesnapshotcontents            vsc,vscs            snapshot.storage.k8s.io/v1        false        VolumeSnapshotContent
volumesnapshots                   vs                  snapshot.storage.k8s.io/v1        true         VolumeSnapshot

Now ensure ~/.local/bin/ is in your PATH so sudo is not required for installing Helm:

mkdir -p "$HOME/.local/bin/"
echo "export PATH=\"\$HOME/.local/bin:\$PATH\"" >> "$HOME/.bashrc"
source "$HOME/.bashrc"

Next, install Helm:

wget https://get.helm.sh/helm-v3.13.3-linux-amd64.tar.gz
tar xvf helm-v3.13.3-linux-amd64.tar.gz
mv linux-amd64/helm "$HOME/.local/bin/"

Confirm Helm is correctly installed:

helm version

Sample output:

version.BuildInfo{Version:"v3.13.3", GitCommit:"c8b948945e52abba22ff885446a1486cb5fd3474", GitTreeState:"clean", GoVersion:"go1.20.11"}

Now we can install the snapshot controller via a Helm chart using the default values.

helm repo add democratic-csi https://democratic-csi.github.io/charts/
helm repo update
helm -n kube-system install \
    snapshot-controller \
    democratic-csi/snapshot-controller \
    --version 0.2.4

Installing the Rook operator

The Rook operator performs lifecycle management functions of Ceph clusters and must be installed prior to the Ceph cluster itself. By default, it expects to be deployed in the namespace rook-ceph, though this behavior is configurable.

Remember to specify the option enableDiscoveryDaemon=true which automatically scans each node on a regular interval for clean (unformatted and unpartitioned), newly attached data disks to be added to the storage pool.

helm repo add rook-release https://charts.rook.io/release
helm repo update
helm -n rook-ceph install \
    rook-ceph \
    rook-release/rook-ceph \
    --set enableDiscoveryDaemon=true \
    --create-namespace

Now wait for the operator and per-node discovery daemon to enter a Ready state:

kubectl -n rook-ceph wait \
    --for=condition=Ready \
    pods \
    -l 'app in (rook-ceph-operator, rook-discover)' \
    --timeout=180s

Sample output:

pod/rook-ceph-operator-775858d6b-98qsk condition met
pod/rook-discover-ln4gp condition met
pod/rook-discover-tgzdn condition met
pod/rook-discover-v779r condition met

If you only see the first line of output, re-run the command above since the discovery daemon pods are not created until the operator is up and running.

Installing the Ceph cluster

Now install the Ceph cluster - we’ll briefly go through each of the key components in a moment.

Ensure that the following Helm chart values are set:

cephBlockPoolsVolumeSnapshotClass.enabled=true for generating the RBD VolumeSnapshotClass automatically
cephFileSystemVolumeSnapshotClass.enabled=true for generating the Ceph filesystem VolumeSnapshotClass automatically

helm -n rook-ceph install \
    rook-ceph-cluster \
    rook-release/rook-ceph-cluster \
    --set cephBlockPoolsVolumeSnapshotClass.enabled=true \
    --set cephFileSystemVolumeSnapshotClass.enabled=true

Now wait for the Ceph cluster to enter a Ready state - this may take up to 15 minutes:

kubectl -n rook-ceph wait --for=condition=Ready cephclusters --all --timeout=900s

Sample output:

cephcluster.ceph.rook.io/rook-ceph condition met

Let’s take a closer look at our Ceph cluster.

kubectl -n rook-ceph get cephcluster rook-ceph

Sample output:

NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH      EXTERNAL   FSID
rook-ceph   /var/lib/rook     3          45m   Ready   Cluster created successfully   HEALTH_OK              5d6cc90c-87f1-4de7-8b6d-8c18562a9fa0

HEALTH_OK indicates that our Ceph cluster is healthy. To really see what’s going on, it may help to view the created pods.

kubectl -n rook-ceph get pods -l 'app,app notin (rook-ceph-operator, rook-discover)'

Sample output:

NAME                                                READY   STATUS      RESTARTS      AGE
csi-cephfsplugin-4jrkb                              2/2     Running     1 (46m ago)   47m
csi-cephfsplugin-provisioner-cd88d8c4-fcm9g         5/5     Running     1 (46m ago)   47m
csi-cephfsplugin-provisioner-cd88d8c4-jbwcb         5/5     Running     0             47m
csi-cephfsplugin-pzbgx                              2/2     Running     0             47m
csi-cephfsplugin-w6f6h                              2/2     Running     1 (46m ago)   47m
csi-rbdplugin-gbrc8                                 2/2     Running     1 (46m ago)   47m
csi-rbdplugin-kl6cd                                 2/2     Running     1 (46m ago)   47m
csi-rbdplugin-provisioner-f6d4c9775-wlb9n           5/5     Running     1 (46m ago)   47m
csi-rbdplugin-provisioner-f6d4c9775-xvvwq           5/5     Running     0             47m
csi-rbdplugin-s89gt                                 2/2     Running     0             47m
rook-ceph-crashcollector-master0-5787b75899-xh62s   1/1     Running     0             41m
rook-ceph-crashcollector-worker0-6d859cbd7-tpcvm    1/1     Running     0             42m
rook-ceph-crashcollector-worker1-6f8d676454-hrlfx   1/1     Running     0             37m
rook-ceph-mds-ceph-filesystem-a-5977dbcfc8-m6tp7    2/2     Running     0             42m
rook-ceph-mds-ceph-filesystem-b-57d57cc6d4-v542c    2/2     Running     0             41m
rook-ceph-mgr-a-6d48d5cd4-l5jq5                     3/3     Running     0             45m
rook-ceph-mgr-b-7b944ddd97-skknv                    3/3     Running     0             45m
rook-ceph-mon-a-764cbbbd9f-hdmtx                    2/2     Running     0             47m
rook-ceph-mon-b-5b6cd88d59-mh4p4                    2/2     Running     0             45m
rook-ceph-mon-c-795c785489-p65nr                    2/2     Running     0             45m
rook-ceph-osd-0-bb954988-86rx6                      2/2     Running     0             44m
rook-ceph-osd-1-766d6f9dd5-ct4np                    2/2     Running     0             44m
rook-ceph-osd-2-6f9df74867-ntjjg                    2/2     Running     0             44m
rook-ceph-osd-prepare-master0-4gpbg                 0/1     Completed   0             43m
rook-ceph-osd-prepare-worker0-zkvdm                 0/1     Completed   0             43m
rook-ceph-osd-prepare-worker1-nlq89                 0/1     Completed   0             43m
rook-ceph-rgw-ceph-objectstore-a-86cc9554df-476vf   2/2     Running     0             37m

That’s a lot of pods! We’ll focus on the important ones and skip the rest:

Ceph monitor (MON): responsible for maintaining the state of the cluster. Like etcd, it relies on a quorum so 3 MONs are required for high availability. On Kubernetes, they run as Pods rook-ceph-mon-{a,b,c}*, one on each node
Ceph manager (MGR): responsible for interacting with external monitoring and management systems. Runs in active-standby mode so 2 replicas are sufficient. On Kubernetes, they run as Pods rook-ceph-mgr-{a,b}*
Ceph OSD: daemon for interacting with logical volumes, one per volume. Stands for "Object Storage Daemon". On Kubernetes, they run as Pods rook-ceph-osd-n* where n is a non-negative integer
Ceph block device (RBD): also known as RADOS block device (hence the abbreviation RBD). Provides block-level storage to stateful Kubernetes applications

Now that our Ceph cluster is running, let’s reuse the MinIO example from our last article and confirm that we are now able to successfully create a VolumeSnapshot from a PVC. But before that, let’s confirm that we have a working StorageClass for creating PVCs and a working VolumeSnapshotClass for creating volume snapshots from our PVCs.

kubectl get storageclass
kubectl get volumesnapshotclass

Sample output:

NAME                   PROVISIONER                     RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ceph-block (default)   rook-ceph.rbd.csi.ceph.com      Delete          Immediate           true                   66m
ceph-bucket            rook-ceph.ceph.rook.io/bucket   Delete          Immediate           false                  66m
ceph-filesystem        rook-ceph.cephfs.csi.ceph.com   Delete          Immediate           true                   66m
NAME              DRIVER                          DELETIONPOLICY   AGE
ceph-block        rook-ceph.rbd.csi.ceph.com      Delete           66m
ceph-filesystem   rook-ceph.cephfs.csi.ceph.com   Delete           66m

Observe that RBD is marked as the default StorageClass and has a corresponding VolumeSnapshotClass so we expect to be able to create snapshots from our PVCs normally.

Installing MinIO and creating a `VolumeSnapshot`

This is much the same as with our previous lab.

Add the Bitnami repo and install MinIO:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
helm -n minio install \
    minio \
    bitnami/minio \
    --version 12.9.4 \
    --create-namespace

Now wait for it to become ready, which should take no longer than 5 minutes:

kubectl -n minio wait --for=condition=Ready pods --all --timeout=300s

Observe that a PVC minio is created and bound:

kubectl -n minio get pvc minio

Now create a snapshot from our MinIO PVC:

kubectl -n minio apply -f - << EOF
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: minio
spec:
  volumeSnapshotClassName: ceph-block
  source:
    persistentVolumeClaimName: minio
EOF

Wait for our snapshot to become ready:

kubectl -n minio wait \
    --for=jsonpath='{.status.readyToUse}'=true \
    volumesnapshot \
    minio \
    --timeout=180s

Sample output:

volumesnapshot.snapshot.storage.k8s.io/minio condition met

Cleaning up

Unless you wish to continue experimenting with your cluster, log out of the master node now, ensure that your working directory is set to the kubeadm-1m2w/ project you cloned earlier and run the following commands on your own laptop / desktop / workstation to clean up the lab resources to save costs:

export CLOUD_PROVIDER="aws"
tofu -chdir="opentofu/${CLOUD_PROVIDER}/" destroy

Answer yes when prompted.

Concluding remarks and going further

By deploying the production-grade Rook Ceph CSI distributed storage backend to Kubernetes, it is possible to run stateful applications directly on Kubernetes in a production context and combine it with integrated backup and disaster recovery (DR) solutions capable of cross-cluster application-level data recovery such as Velero and Kasten K10 to meet your business continuity, legal, audit and compliance requirements, all without sacrificing the ability to leverage Kubernetes’ powerful features and abstractions such as built-in service discovery mechanisms for managing your entire application. However, in rare cases where performance, stability and permanence* are of utmost importance, it may still make sense to deploy specific stateful component(s) outside of Kubernetes and adopt a partial cloud native hybrid model, just for those affected components.

Storage (CSI) on Kubernetes has always been a tricky subject and has taken a long time to mature compared to other fundamental capabilities such as compute (CRI) and networking (CNI), leaving IT practitioners and managers the impression that “Kubernetes is best for running stateless workloads, run stateful workloads in VMs”. However, as Kubernetes’ storage capabilities mature over time, this fundamental notion is under challenge. So, what do you think?

* From the GitLab 3k reference architecture with a cloud native hybrid deployment model:

Hybrid installations leverage the benefits of both cloud native and traditional compute deployments. With this, stateless components can benefit from cloud native workload management benefits while stateful components are deployed in compute VMs with Linux package installations to benefit from increased permanence.

Subscribe: