Skip to content

Replacing a Failed Master Host on OCP 4.3.x

This procedure assumes that there is still an etcd quorum in the cluster.
If you have lost the majority of your master hosts, leading to etcd quorum loss, then you must follow the disaster recovery procedure to recover from lost master hosts instead of this procedure.

(Cover image :

To replace a Single Master Host:
– Remove the member from the etcd cluster
– Add the member back 

Here, we have 3 Master Nodes, etcd-[0-2].ocp4.ocp.abip, and trying to remove the etcd-2.ocp4.ocp.abip node.
Let’s assume this node has failed

etcd-0.ocp4.ocp.abip   192.168.24.51
etcd-1.ocp4.ocp.abip   192.168.24.52
etcd-2.ocp4.ocp.abip   192.168.24.53

Removing a Failed Master Host from the etcd Cluster.
Prerequisites:
– Access to the cluster as cluster-admin role
– SSH Access to an Active Master Host. We’ll perform the activities from etcd-1.ocp4.ocp.abip node.

Procedures:
1. Access an Active Master Host
2. View the list of Pods with etcd

[root@bastion ~]# ssh [email protected]

[core@etcd-1 ~]$ oc login -u admin #
The server uses a certificate signed by an unknown authority.
You can bypass the certificate check, but any data you send to the server could be intercepted by others.
Use insecure connections? (y/n): y

[core@etcd-1 ~]$ oc get pods -n openshift-etcd
NAME                               READY   STATUS    RESTARTS   AGE
etcd-member-etcd-0.ocp4.ocp.abip   2/2     Running   62         22d
etcd-member-etcd-1.ocp4.ocp.abip   2/2     Running   57         22d
etcd-member-etcd-2.ocp4.ocp.abip   2/2     Running   59         22d

3. Remove the Failed Master Host, etcd-2.ocp4.ocp.abip.
The problem we have in OCP Restricted Network, the etcd-member-remove.sh tried to download the etcdctl from the internet. (Please refer to the link provided at the end of this Blog)
We need to modify the script as we did in backing up the etcd data:
– Find the etcdctl 
– Copy it somewhere, e.g: /root/etcdctl
– Modify the script to disable dl_etcdctl function, and point ETCDCTL environment variable to /root/etcdctl

[core@etcd-1 ~]$ which etcd-member-remove.sh
/usr/local/bin/etcd-member-remove.sh

[core@etcd-1 ~]$ sudo -E /usr/local/bin/etcd-member-remove-disconnected.sh etcd-member-etcd-2.ocp4.ocp.abip
Trying to backup etcd client certs..
etcd client certs already backed up and available ./assets/backup/
Member d4d8cf3147795936 removed from cluster 46efcf9423373cdf
etcd member etcd-member-etcd-2.ocp4.ocp.abip with d4d8cf3147795936 successfully removed..

4. Verify that the etcd member has been successfully removed from the cluster:

[core@etcd-1 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{print $1}')

[core@etcd-1 ~]$ sudo crictl exec -it $id /bin/sh
sh-4.2#

sh-4.2# export ETCDCTL_API=3
sh-4.2# export ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt
sh-4.2# export ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt)
sh-4.2# export ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)


sh-4.2# etcdctl member list -w table
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+
|        ID        | STATUS  |               NAME               |            PEER ADDRS             |        CLIENT ADDRS        |
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+
| 7122dcf57e681d7d | started | etcd-member-etcd-0.ocp4.ocp.abip | # | # |
| abcc869a529d85cb | started | etcd-member-etcd-1.ocp4.ocp.abip | # | # |
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+

Adding a Master Host Back to the etcd Cluster
Prerequisites:
– Access to the cluster as cluster-admin role
– SSH Access to the Master Host to Add to the etcd Cluster (the one we removed, etcd-2.ocp4.ocp.abip)
– The IP Address of an Existing Active etcd Member
– For Restricted Environment, need to modify etcd-member-add.sh and etcd-snapshot-backup.sh scripts as we did before (Please refer to the link we provided at the end of this Blog)

1.Access the Master Host to Add to the etcd Cluster

[root@bastion ~]# ssh [email protected]

2. Run the etcd-member-add.sh script and pass in two parameters:
– IP Address of an existing etcd member: 192.168.24.52
The name of the etcd member to Add, etcd-2.ocp4.ocp.abip

[core@etcd-2 ~]$ sudo -E /usr/local/bin/etcd-member-add-disconnected.sh 192.168.24.52 etcd-member-etcd-2.ocp4.ocp.abip
etcd-member.yaml found in ./assets/backup/
etcd.conf backup upready exists ./assets/backup/etcd.conf
Trying to backup etcd client certs..
etcd client certs already backed up and available ./assets/backup/
Stopping etcd..
etcd data-dir backup found ./assets/backup/etcd..
Updating etcd membership..
Member 7f77e67d2bf8334b added to cluster 46efcf9423373cdf

ETCD_NAME="etcd-member-etcd-2.ocp4.ocp.abip"
ETCD_INITIAL_CLUSTER="etcd-member-etcd-0.ocp4.ocp.abip=https://etcd-0.ocp4.ocp.abip:2380,etcd-member-etcd-2.ocp4.ocp.abip=https://etcd-2.ocp4.ocp.abip:2380,etcd-member-etcd-1.ocp4.ocp.abip=https://etcd-1.ocp4.ocp.abip:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-2.ocp4.ocp.abip:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
Starting etcd..

3. Verify that the new member is in the list of Pods associated with etcd and that its status is Running

[core@etcd-1 ~]$ oc get pods -n openshift-etcd
NAME                               READY   STATUS    RESTARTS   AGE
etcd-member-etcd-0.ocp4.ocp.abip   2/2     Running   62         22d
etcd-member-etcd-1.ocp4.ocp.abip   2/2     Running   57         22d
etcd-member-etcd-2.ocp4.ocp.abip   2/2     Running   0          69s

4. Verify that the etcd member has been successfully added to the etcd cluster, and the new member is healthy:

[core@etcd-1 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{print $1}')

[core@etcd-1 ~]$ sudo crictl exec -it $id /bin/sh
sh-4.2#

sh-4.2# export ETCDCTL_API=3
sh-4.2# export ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt
sh-4.2# export ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt)
sh-4.2# export ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)

sh-4.2# etcdctl member list -w table
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+
|        ID        | STATUS  |               NAME               |            PEER ADDRS             |        CLIENT ADDRS        |
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+
| 7122dcf57e681d7d | started | etcd-member-etcd-0.ocp4.ocp.abip | # | # |
| 7f77e67d2bf8334b | started | etcd-member-etcd-2.ocp4.ocp.abip | # | # |
| abcc869a529d85cb | started | etcd-member-etcd-1.ocp4.ocp.abip | # | # |
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+

sh-4.2# etcdctl endpoint health --cluster
# is healthy: successfully committed proposal: took = 39.875839ms
# is healthy: successfully committed proposal: took = 51.685488ms
# is healthy: successfully committed proposal: took = 61.023569ms

PS:
We need to revert back the changes we have on etcd-* scripts to avoid machine-config operator goes to DEGRADED state due to file mismatch, verification: oc describe pods -n machine-config-operator machine-config-daemon-XXX (the nodes where we modify the script)
To fix the DEGRADED state, we need to delete the problematic pods

Note:
– For OCP nodes connected using proxy, We might need to add HTTP(S)_PROXY environment variables on the script.
– For OCP 4.3.5 and later, You might not need to modify the backup script.
– Please refer to below link to modify the scripts for Restricted Environment.
Perform etcd Backup for Restricted Environment on OCP 4.3.x

Disclaimer:

The views expressed and the content shared in all published articles on this website are solely those of the respective authors, and they do not necessarily reflect the views of the author’s employer or the techbeatly platform. We strive to ensure the accuracy and validity of the content published on our website. However, we cannot guarantee the absolute correctness or completeness of the information provided. It is the responsibility of the readers and users of this website to verify the accuracy and appropriateness of any information or opinions expressed within the articles. If you come across any content that you believe to be incorrect or invalid, please contact us immediately so that we can address the issue promptly.



Platform Consultant at Red Hat, Oracle Engineered Systems Specialist

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.