[Write-Up: Backup for ClickHouse]

Overview

In a previous article, we explored the installation process of a ClickHouse cluster within a GKE environment using the Altinity Clickhouse Operator. Nevertheless, when deploying this cluster for production use, it is important to create a reliable backup configuration. 

This article will guide you through the essential steps and tools necessary for configuring backups effectively.

Altinity ClickHouse operator doesn't provide a backup functionality "out-of-the-box". However, Altinity team has a project called clickhouse-backup (https://github.com/Altinity/clickhouse-backup) for easy ClickHouse backup and restore with support for many cloud and non-cloud storage types.

It has a rich functionality:

  • Easy creating and restoring backups of all or specific tables.
  • Efficient storing of multiple backups on the file system.
  • Uploading and downloading with streaming compression.
  • Works with AWS, GCS, Azure, Tencent COS, FTP, SFTP.
  • Support for Atomic Database Engine.
  • Support for multi disks installations.
  • Support for custom remote storage types via rclone, kopia, restic, rsync etc.
  • Support for incremental backups on remote storage.

Let's dive deeper, but before it we need to install required tools:

  • gcloud
  • gsutil
  • kubectl

We are going to keep backups using GCS. GCS is a cloud object storage service provided by Google Cloud Platform (GCP). Google Cloud Storage allows users to store and retrieve data in a highly scalable and durable manner. It is commonly used for various purposes, including hosting static websites, storing large datasets, and serving as a backend for applications that require reliable and scalable storage.

We will use GCP Workload Idenity to allow pods from a GKE cluster to work with GCS. Google Cloud Platform (GCP) Workload Identity is a feature that enhances the security and manageability of applications and services running on GCP. Workload Identity provides a way to securely delegate permissions to Google Cloud services and resources without the need for long-lived service account credentials.

How to backup a cluster

So, now we are ready to go through the whole process of configuring backups step-by-step.

  1. Configure gcloud to work with the right project
  2. Make sure you have access to the GKE cluster with installed the Clickhouse operator
  3. Make sure Workload identity federation for GKE is enabled at the cluster and nodepool levels:
  • Cluster settings:
img1
  • Nodepool settings:
GCE instance metadata

4. Create a GCS bucket:

gcloud storage buckets create gs://my-bucket-to-keep-clickhouse-backups

where  my-bucket-to-keep-clickhouse-backups is the name you want to give your bucket.

5. Create a Kubernetes serviceaccount in the namespace with the ClickHouse operator:

kubectl create serviceaccount clickhouse-gcp --namespace clickhouse

where clickhouse-gcp – the name of your new Kubernetes service account; clickhouse - the name of the Kubernetes namespace for the service account.

6. Create a GCP service account:

gcloud iam service-accounts create clickhouse-cluster

where clickhouse-cluster – the name of the new IAM service account.

7. Grant permissions for the clickhouse-cluster serviceaccount to work with the GCS bucket:

gsutil iam ch serviceAccount:clickhouse-cluster@GCP_PROJECT_ID.gserviceaccount.com:roles/storage.objectViewer gs://my-bucket-to-keep-clickhouse-backups

8. Create a binding between the Kubernetes service account and GCP service account. It allows the Kubernetes service account to act as the IAM service account:

gcloud iam service-accounts add-iam-policy-binding GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA_NAME]"

In our case it looks like:

gcloud iam service-accounts add-iam-policy-binding clickhouse-cluster@GCP_PROJECT_ID.iam.gserviceaccount.com --role roles/iam.workloadIdentityUser --member "serviceAccount:GCP_PROJECT_ID.svc.id.goog[clickhouse/clickhouse-gcp]"

9. Annotate existing Kubernetes service account:

kubectl annotate serviceaccount clickhouse-gcp --namespace clickhouse iam.gke.io/gcp-service-account=clickhouse-cluster@GCP_PROJECT_ID.iam.gserviceaccount.com

10. Add the clickhouse-backup as a sidecar container in the podTemplate section:

- name: clickhouse-backup
              image: altinity/clickhouse-backup:2.4.25
              imagePullPolicy: Always
              command:
                - bash
                - -xc
                - "/bin/clickhouse-backup server"
              env:
                - name: LOG_LEVEL
                  value: "debug"
                - name: ALLOW_EMPTY_BACKUPS
                  value: "true"
                - name: API_LISTEN
                  value: "0.0.0.0:7171"
                - name: BACKUPS_TO_KEEP_REMOTE
                  value: "14"
                - name: REMOTE_STORAGE
                  value: "gcs"
                - name: GCS_BUCKET
                  value: "my-bucket-to-keep-clickhouse-backups"
                - name: GCS_PATH
                  value: backup/shard-{shard}
                - name: CLICKHOUSE_SKIP_TABLES
                  value: "system.*,INFORMATION_SCHEMA.*,default.*" # All tables except these will be backed up 
              ports:
                - name: backup-rest
                  containerPort: 7171
              volumeMounts:
                - mountPath: /var/lib/clickhouse
                  name: server-storage

What this configuration does:

  • log level is DEBUG;
  • REST API listens on all interfaces;
  • remote storage is GCS;
  • a volume with Clickhouse data is shared between containers. Otherwise, clickhouse-backup doesn’t backup data and backups only metadata;
  • backup all tables except these "system.,INFORMATION_SCHEMA.*,default.*";
  • backups stored for 14 days.

11. Create a Kubernetes cronjob to take a backup everyday.

  • Create a Kubernetes manifest (cronjob.yaml) with the next content:
apiVersion: batch/v1
kind: CronJob
metadata:
  name: clickhouse-backup
spec:
  # every day at 00:00
  schedule: "0 0 * * *"
  concurrencyPolicy: "Forbid"
  jobTemplate:
    spec:
      backoffLimit: 1
      completions: 1
      parallelism: 1
      template:
        metadata:
          labels:
            app: clickhouse-backup
        spec:
          restartPolicy: Never
          containers:
            - name: run-backup
              image: bash
              imagePullPolicy: IfNotPresent
              env:
                # use first replica in each shard, use `kubectl get svc -n clickhouse | grep chi-dc1`
                - name: CLICKHOUSE_SERVICES
                  value: chi-dc1-cluster1-0-0
              command:
                - bash
                - -ec
                - CLICKHOUSE_SERVICES=$(echo $CLICKHOUSE_SERVICES | tr "," " ");
                  BACKUP_DATE=$(date +%Y-%m-%d-%H-%M-%S);
                  DIFF_FROM[$SERVER]="";
                  declare -A BACKUP_NAMES;

                  apk add curl jq;

                  for SERVER in $CLICKHOUSE_SERVICES; do
                    BACKUP_NAMES[$SERVER]="full-$BACKUP_DATE";
                    echo "set backup name on $SERVER = ${BACKUP_NAMES[$SERVER]}";
                  done;
                  for SERVER in $CLICKHOUSE_SERVICES; do
                    echo "create ${BACKUP_NAMES[$SERVER]} on $SERVER";
                    curl -s "$SERVER:7171/backup/create?rbac=true&name=${SERVER}-${BACKUP_NAMES[$SERVER]}" -X POST;
                  done;
                  for SERVER in $CLICKHOUSE_SERVICES; do
                    while [[ "in progress" == $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"create --rbac ${SERVER}-${BACKUP_NAMES[$SERVER]}\") | .status") ]]; do
                      echo "still in progress ${BACKUP_NAMES[$SERVER]} on $SERVER";
                      sleep 1;
                    done;
                    if [[ "success" != $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"create --rbac ${SERVER}-${BACKUP_NAMES[$SERVER]}\") | .status") ]]; then
                      echo "error create --rbac ${BACKUP_NAMES[$SERVER]} on $SERVER";
                      curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"create --rbac ${SERVER}-${BACKUP_NAMES[$SERVER]}\") | .error";
                      exit 1;
                    fi;
                  done;
                  for SERVER in $CLICKHOUSE_SERVICES; do
                    echo "upload ${DIFF_FROM[$SERVER]} ${BACKUP_NAMES[$SERVER]} on $SERVER";
                    curl -s "$SERVER:7171/backup/upload/${SERVER}-${BACKUP_NAMES[$SERVER]}?${DIFF_FROM[$SERVER]}" -X POST;
                  done;
                  if [[ ${DIFF_FROM[$SERVER]} == "" ]]; then
                    UPLOAD_COMMAND="upload ${SERVER}-${BACKUP_NAMES[$SERVER]}";
                  else
                    UPLOAD_COMMAND="upload ${DIFF_FROM[$SERVER]} ${SERVER}-${BACKUP_NAMES[$SERVER]}";
                  fi;
                  for SERVER in $CLICKHOUSE_SERVICES; do
                    while [[ "in progress" == $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"${UPLOAD_COMMAND}\") | .status") ]]; do
                      echo "upload still in progress ${BACKUP_NAMES[$SERVER]} on $SERVER";
                      sleep 5;
                    done;
                    if [[ "success" != $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"${UPLOAD_COMMAND}\") | .status") ]]; then
                      echo "error ${BACKUP_NAMES[$SERVER]} on $SERVER";
                      curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"${UPLOAD_COMMAND}\") | .error";
                      exit 1;
                    fi;
                    curl -s "$SERVER:7171/backup/delete/local/${SERVER}-${BACKUP_NAMES[$SERVER]}" -X POST | jq .;
                  done;
                  echo "BACKUP CREATED"
  • Create a Kubernetes cronjob:
kubectl apply -f cronjob.yaml -n clickhouse

That's it. Now we can sleep peacefully :)

How to restore a cluster

Let's consider the recovery process assuming that we've lost the entire cluster. 

These are steps to restore it:

1. Get a list of backups and copy the part of a path after shard-0:

gsutil ls gs://clickhouse-backups-dev/backup/shard-0

2. Create a manifest (job.yaml) with the next content to deploy a Kuberentes Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: clickhouse-restore
spec:
  backoffLimit: 1
  completions: 1
  parallelism: 1
  template:
    metadata:
      labels:
        app: clickhouse-restore
    spec:
      restartPolicy: Never
      containers:
        - name: run-restore
          image: bash
          imagePullPolicy: IfNotPresent
          env:
            - name: BACKUP_NAME
              value: BACKUP_NAME # Set backup_name that we need to restore
            # use all replicas in each shard to restore schema
            - name: CLICKHOUSE_SCHEMA_RESTORE_SERVICES
              value: chi-dc1-cluster1-0-0,chi-dc1-cluster1-0-1
            # use only first replica in each shard to restore data
            - name: CLICKHOUSE_DATA_RESTORE_SERVICES
              value: chi-dc1-cluster1-0-0
          command:
            - bash
            - -ec
            - |
              declare -A BACKUP_NAMES;
              CLICKHOUSE_SCHEMA_RESTORE_SERVICES=$(echo $CLICKHOUSE_SCHEMA_RESTORE_SERVICES | tr "," " ");
              CLICKHOUSE_DATA_RESTORE_SERVICES=$(echo $CLICKHOUSE_DATA_RESTORE_SERVICES | tr "," " ");
              #Install extra components
              apk add curl jq;

              for SERVER in $CLICKHOUSE_SCHEMA_RESTORE_SERVICES; do
                BACKUP_NAMES[$SERVER]="${BACKUP_NAME}";
                curl -s "$SERVER:7171/backup/download/${BACKUP_NAMES[$SERVER]}?schema" -X POST | jq .;
              done;

              for SERVER in $CLICKHOUSE_SCHEMA_RESTORE_SERVICES; do
                while [[ "in progress" == $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"download --schema ${BACKUP_NAMES[$SERVER]}\") | .status") ]]; do
                  echo "Download is still in progress ${BACKUP_NAMES[$SERVER]} on $SERVER";
                  sleep 1;
                done;
                if [[ "success" != $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"download --schema ${BACKUP_NAMES[$SERVER]}\") | .status") ]]; then
                  echo "error download --schema ${BACKUP_NAMES[$SERVER]} on $SERVER";
                  curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"download --schema ${BACKUP_NAMES[$SERVER]}\") | .error";
                  exit 1;
                fi;
              done;

              for SERVER in $CLICKHOUSE_SCHEMA_RESTORE_SERVICES; do
                curl -s "$SERVER:7171/backup/restore/${BACKUP_NAMES[$SERVER]}?schema&rm&rbac" -X POST | jq .;
              done;

              for SERVER in $CLICKHOUSE_SCHEMA_RESTORE_SERVICES; do
                while [[ "in progress" == $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"restore --schema --rm --rbac ${BACKUP_NAMES[$SERVER]}\") | .status") ]]; do
                  echo "Restore is still in progress ${BACKUP_NAMES[$SERVER]} on $SERVER";
                  sleep 1;
                done;
                if [[ "success" != $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"restore --schema --rm --rbac ${BACKUP_NAMES[$SERVER]}\") | .status") ]]; then
                  echo "error restore --schema --rm --rbac ${BACKUP_NAMES[$SERVER]} on $SERVER";
                  curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"restore --schema --rm --rbac ${BACKUP_NAMES[$SERVER]}\") | .error";
                  exit 1;
                fi;
                curl -s "$SERVER:7171/backup/delete/local/${BACKUP_NAMES[$SERVER]}" -X POST | jq .;
              done;

              for SERVER in $CLICKHOUSE_DATA_RESTORE_SERVICES; do
                BACKUP_NAMES[$SERVER]="${BACKUP_NAME}";
                curl -s "$SERVER:7171/backup/download/${BACKUP_NAMES[$SERVER]}" -X POST | jq .;
              done;

              for SERVER in $CLICKHOUSE_DATA_RESTORE_SERVICES; do
                while [[ "in progress" == $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"download ${BACKUP_NAMES[$SERVER]}\") | .status") ]]; do
                  echo "Download is still in progress ${BACKUP_NAMES[$SERVER]} on $SERVER";
                  sleep 1;
                done;
                if [[ "success" != $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"download ${BACKUP_NAMES[$SERVER]}\") | .status") ]]; then
                  echo "error download ${BACKUP_NAMES[$SERVER]} on $SERVER";
                  curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"download ${BACKUP_NAMES[$SERVER]}\") | .error";
                  exit 1;
                fi;
              done;

              for SERVER in $CLICKHOUSE_DATA_RESTORE_SERVICES; do
                curl -s "$SERVER:7171/backup/restore/${BACKUP_NAMES[$SERVER]}?data" -X POST | jq .;
              done;

              for SERVER in $CLICKHOUSE_DATA_RESTORE_SERVICES; do
                while [[ "in progress" == $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"restore --data ${BACKUP_NAMES[$SERVER]}\") | .status") ]]; do
                  echo "Restore is still in progress ${BACKUP_NAMES[$SERVER]} on $SERVER";
                  sleep 1;
                done;
                if [[ "success" != $(curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"restore --data ${BACKUP_NAMES[$SERVER]}\") | .status") ]]; then
                  echo "error restore --data ${BACKUP_NAMES[$SERVER]} on $SERVER";
                  curl -s "$SERVER:7171/backup/actions" | jq -r ". | select(.command == \"restore --data ${BACKUP_NAMES[$SERVER]}\") | .error";
                  exit 1;
                fi;
                curl -s "$SERVER:7171/backup/delete/local/${BACKUP_NAMES[$SERVER]}" -X POST | jq .;
              done;

              echo "RESTORE FINISHED"
  • Please review and set correct values for the BACKUP_NAME, CLICKHOUSE_SCHEMA_RESTORE_SERVICES, CLICKHOUSE_DATA_RESTORE_SERVICES environment variables.

3. Delete a clickhouseInstallation and a clickhouse-keeper statefulset:

kubectl delete clickhouseInstallation dc1 -n clickhouse
kubectl delete statefulset clickhouse-keeper -n clickhouse
kubectl delete pvc -l app=clickhouse-keeper -n clickhouse

4. Deploy a clickhouseInstallation resource (custom resource you used to deploy the Clickhouse-cluster into the kubernetes).

5. Apply the job.yaml manifest to start restoring process.

kubectl apply -f job.yaml -n clickhouse

6. Wait for ClickHouse cluster's pods are Up and Running.

7. If you had manually created ClickHouse users, then restart (delete) ClickHouse pods one-by-one to run a process of restoring Users/Permissions. 

Conclusion

In summary, understanding how to back up and restore your ClickHouse cluster is crucial for protecting your data and ensuring your system can recover from any disasters. With the user-friendly clickhouse-backup tool, we've covered a step-by-step guide on effectively securing your cluster's information. By choosing Google Cloud Storage (GCS) as a reliable storage solution, you not only boost the safety of your backups but also make sure you can easily access important data when necessary.