Kubernetes Disaster Recovery: How to Recover Broken Clusters

Kubernetes deployments have 3 distinct types of nodes: master nodes, ETCD nodes, and worker nodes. In high availability (HA) setups, all of these node types are replicated. Failures of individual nodes will not cause catastrophic consequences, but you need to get your cluster healthy as quickly as possible to prevent further failures.

Recovering from events like this can be extremely difficult. Not only are your nodes (and production software) down, but you must expend IT and engineering effort to fix them. What’s worse, there aren’t any comprehensive guides to get our cluster back up and running. Until now.

An ETCD Node has Failed

How to tell?

The kubectl command has a resource, Component Statuses, that will show the health of ETCD:

$ kubectl get cs
NAME                 STATUS      MESSAGE
scheduler            Healthy     ok                                                               
controller-manager   Healthy     ok                                                               
etcd-2               Healthy     {"health": "true"}                                                               
etcd-0               Healthy     {"health": "true"}                                                               
etcd-1               Unhealthy   Client.Timeout exceeded while awaiting headers

HA Recovery Steps

If Kubernetes was properly configured in HA mode, then the cluster should be able to handle losing a single ETCD node. Create a new node to replace the failed ETCD node. Record the IP Address of the new node, but do not start ETCD yet.

On one of the working ETCD Nodes, remove the failed ETCD node from the cluster and add the IP Address of the new node:

$ etcdctl --endpoints=http://127.0.0.1:2379 member list
10b576500ed3ae71: name=kube-etcd-1 peerURLs=https://10.0.0.1:2380 clientURLs=https://10.0.0.1:2379 isLeader=false
30bcb5f2f4c17805: name=kube-etcd-2 peerURLs=https://10.0.0.2:2380 clientURLs=https://10.0.0.2:2379 isLeader=false
a908b0f9f07a7127: name=kube-etcd-3 peerURLs=https://10.0.0.3:2380 clientURLs=https://10.0.0.3:2379 isLeader=true

$ etcdctl --endpoints=http://127.0.0.1:2379 member remove 30bcb5f2f4c17805
Removed member 30bcb5f2f4c17805 from cluster

$ etcdctl member add kube-etcd-4 --peer-urls=http://[new node IP]:2380
Member 2be1eb8f84b7f63e added to cluster ef37ad9dc622a7c4

Configure the new ETCD node to connect to the existing cluster:

export ETCD_NAME="kube-etcd-4"
export ETCD_INITIAL_CLUSTER="kube-etcd-1=https://10.0.0.1:2380,kube-etcd-3=https://10.0.0.3:2380,kube-etcd-4=https://[new node IP]:2380"
export ETCD_INITIAL_CLUSTER_STATE=existing
etcd [flags]

Finally, login to each Kubernetes master and update the kube-apiserver component’s --etcd-servers= option to point to the new ETCD node.

Non-HA Recovery Steps

If Kubernetes was not running in HA mode and the only ETCD node has failed, the cluster will be down.

If you still have access to the disk or a snapshot, attempt to recover the ETCD data directory. Typically this is the /var/lib/etcd directory. Build a new ETCD node with the recovered data and the same flags as the failed ETCD node except this time, set ETCD_INITIAL_CLUSTER_STATE=existing.

Login to the Kubernetes master and update the kube-apiserver component’s --etcd-servers= option to point to the new etcd node.

A Kubernetes master Node has Failed

How to tell?

The kubectl get nodes command will report master nodes that are unreachable as NotReady:

$ kubectl get nodes
NAME            STATUS     AGE       VERSION
kube-master-1   Ready      21h       v1.7.2
kube-master-2   NotReady   20h       v1.7.2
kube-master-3   Ready      20h       v1.7.2
kube-worker-1   Ready      17h       v1.7.2
kube-worker-2   Ready      17h       v1.7.2

HA Recovery Steps

Create a new Kubernetes node and join it to the working cluster. Add the appropriate master labels and taints, either in the kubelet configuration, or by using kubectl:

kubectl label nodes kube-master-4 node-role.kubernetes.io/master=
kubectl taint nodes kube-master-4 node-role.kubernetes.io/master=:NoSchedule

Login to one of the functioning Kubernetes masters and copy the configuration to the new Kubernetes master. Usually this can be accomplished by copying the /etc/kubernetes directory from a working master to the new master.

Certificates, kube-apiserver, kube-controller-manager, and kube-scheduler configurations should be copied to the new master. If they are defined as systems services, ensure that all services have started properly. If they are run using kubelet manifests in the /etc/kubernetes/manifests directory, restart the kubelet service and use docker ps -a and docker logs commands to ensure that the services have started properly.

Once the new master is working, remove the failed master:

$ kubectl delete nodes kube-master-2
node "kube-master-2" deleted

Update your Kubernetes API load balancer by removing the IP Address of the failed master node and adding the IP Address of the new master node.

Non-HA Recovery Steps

If Kubernetes was not running in HA mode and the only Kubernetes master node has failed, the cluster will be down.

If you still have access to the disk or a snapshot, attempt to recover the Kubernetes configuration directory containing the original Kubernetes master certificates. Often times this is the /etc/kubernetes directory.

Assuming that ETCD is still intact, create a new Kubernetes master pointing to the existing ETCD cluster. Use the recovered certificates on the new master. Ensure that kube-apiserver, kube-controller-manager, and kube-scheduler are running on the new master.

Login to each worker node and update the kubelet configuration to point to the new Kubernetes master. Often times this file is found at /etc/kubernetes/kubelet.conf. Restart the kubelet service after making this change.

A Kubernetes Worker Node has Failed

How to tell?

Just like detecting a Kubernetes master failure, the kubectl get nodes command will report worker nodes that are unreachable as NotReady:

$ kubectl get nodes
NAME            STATUS     AGE       VERSION
kube-master-1   Ready      21h       v1.7.2
kube-master-2   Ready      20h       v1.7.2
kube-master-3   Ready      20h       v1.7.2
kube-worker-1   Ready      17h       v1.7.2
kube-worker-2   NotReady   17h       v1.7.2

Recovery Steps

Kubernetes will automatically reschedule failed pods onto other nodes in the cluster. Create a new worker to replace the failed node node and join it to the Kubernetes cluster. Once the new worker is working, remove the failed worker:

$ kubectl delete nodes kube-worker-2
node "kube-worker-2" deleted

Wrapping Up: Deploying to Kuberentes

Fixing clusters is one issue, but how about actually getting your application deployed to Kubernetes? Codefresh makes it very easy to deploy applications to Kubernetes clusters. Codefresh supports any Kubernetes cluster, whether its on Google Container Engine, AWS, Azure, Rackspace, IBM Bluemix, or even your own datacenter.

Try it out.

Related Guides:

Related Products:

Codefresh CI/CD Platform

Cluster @%#’d – How to Recover a Broken Kubernetes Cluster

An ETCD Node has Failed

A Kubernetes master Node has Failed

A Kubernetes Worker Node has Failed

Wrapping Up: Deploying to Kuberentes