Streamline your container workflow with Codefresh!

Cluster @%#’d – How to Recover a Broken Kubernetes Cluster

Kubernetes Tutorial | July 28, 2017

Kubernetes deployments have 3 distinct types of nodes: master nodes, ETCD nodes, and worker nodes. In high availability (HA) setups, all of these node types are replicated. Failures of individual nodes will not cause catastrophic consequences, but you need to get your cluster healthy as quickly as possible to prevent further failures.

Recovering from events like this can be extremely difficult. Not only are your nodes (and production software) down, but you must expend IT and engineering effort to fix them. What’s worse, there aren’t any comprehensive guides to get our cluster back up and running. Until now.

An ETCD Node has Failed

How to tell?

The kubectl command has a resource, Component Statuses, that will show the health of ETCD:

HA Recovery Steps

If Kubernetes was properly configured in HA mode, then the cluster should be able to handle losing a single ETCD node. Create a new node to replace the failed ETCD node. Record the IP Address of the new node, but do not start ETCD yet.

On one of the working ETCD Nodes, remove the failed ETCD node from the cluster and add the IP Address of the new node:

Configure the new ETCD node to connect to the existing cluster:

Finally, login to each Kubernetes master and update the kube-apiserver component’s --etcd-servers= option to point to the new ETCD node.

Non-HA Recovery Steps

If Kubernetes was not running in HA mode and the only ETCD node has failed, the cluster will be down.

If you still have access to the disk or a snapshot, attempt to recover the ETCD data directory. Typically this is the /var/lib/etcd directory. Build a new ETCD node with the recovered data and the same flags as the failed ETCD node except this time, set ETCD_INITIAL_CLUSTER_STATE=existing.

Login to the Kubernetes master and update the kube-apiserver component’s --etcd-servers= option to point to the new etcd node.

A Kubernetes master Node has Failed

How to tell?

The kubectl get nodes command will report master nodes that are unreachable as NotReady:

HA Recovery Steps

Create a new Kubernetes node and join it to the working cluster. Add the appropriate master labels and taints, either in the kubelet configuration, or by using kubectl:

Login to one of the functioning Kubernetes masters and copy the configuration to the new Kubernetes master. Usually this can be accomplished by copying the /etc/kubernetes directory from a working master to the new master.

Certificates, kube-apiserver, kube-controller-manager, and kube-scheduler configurations should be copied to the new master. If they are defined as systems services, ensure that all services have started properly. If they are run using kubelet manifests in the /etc/kubernetes/manifests directory, restart the kubelet service and use docker ps -a and docker logs commands to ensure that the services have started properly.

Once the new master is working, remove the failed master:

Update your Kubernetes API load balancer by removing the IP Address of the failed master node and adding the IP Address of the new master node.

Non-HA Recovery Steps

If Kubernetes was not running in HA mode and the only Kubernetes master node has failed, the cluster will be down.

If you still have access to the disk or a snapshot, attempt to recover the Kubernetes configuration directory containing the original Kubernetes master certificates. Often times this is the /etc/kubernetes directory.

Assuming that ETCD is still intact, create a new Kubernetes master pointing to the existing ETCD cluster. Use the recovered certificates on the new master. Ensure that kube-apiserver, kube-controller-manager, and kube-scheduler are running on the new master.

Login to each worker node and update the kubelet configuration to point to the new Kubernetes master. Often times this file is found at /etc/kubernetes/kubelet.conf. Restart the kubelet service after making this change.

A Kubernetes Worker Node has Failed

How to tell?

Just like detecting a Kubernetes master failure, the kubectl get nodes command will report worker nodes that are unreachable as NotReady:

Recovery Steps

Kubernetes will automatically reschedule failed pods onto other nodes in the cluster. Create a new worker to replace the failed node node and join it to the Kubernetes cluster. Once the new worker is working, remove the failed worker:

Wrapping Up: Deploying to Kuberentes

Fixing clusters is one issue, but how about actually getting your application deployed to Kubernetes? Codefresh makes it very easy to deploy applications to Kubernetes clusters. Codefresh supports any Kubernetes cluster, whether its on Google Container Engine, AWS, Azure, Rackspace, IBM Bluemix, or even your own datacenter.

Try it out.

About Caleb Lloyd

Reader Interactions

Enjoy this article? Don't forget to share.

Follow me on Twitter