open wal error: wal: file not found IN ETCD

 

The pod was showing as crashed on https://test-cloud.test.com:8443/console/project/kube-system/browse/pods

etcd was crashed in openshift with error open wal error: wal: file not found
This error is literal, general causes are:

1.) on disk data corruption
2.) invalid restore process

Etcd is an open-source distributed key-value store created by the CoreOS team, now managed by the Cloud Native Computing Foundation. It is pronounced “et-cee-dee”, making reference to distributing the Unix “/etc” directory, where most global configuration files live, across multiple machines. It serves as the backbone of many distributed systems, providing a reliable way for storing data across a cluster of servers. It works on a variety of operating systems including Linux, BSD and OS X.
Etcd has the following properties:
  • Fully Replicated: The entire store is available on every node in the cluster
  • Highly Available: Etcd is designed to avoid single points of failure in case of hardware or network issues
  • Consistent: Every read returns the most recent write across multiple hosts
  • Simple: Includes a well-defined, user-facing API (gRPC)
  • Secure: Implements automatic TLS with optional client certificate authentication
  • Fast: Benchmarked at 10,000 writes per second
  • Reliable: The store is properly distributed using the Raft algorithm

How Does Etcd Work?

To understand how Etcd works, it is important to define three key concepts: leaders, elections, and terms. In a Raft-based system, the cluster holds an election to choose a leader for a given term.
Leaders handle all client requests which need cluster consensus. Requests not requiring consensus, like reads, can be processed by any cluster member. Leaders are responsible for accepting new changes, replicating the information to follower nodes, and then committing the changes once the followers verify receipt. Each cluster can only have one leader at any given time.
If a leader dies, or is no longer responsive, the rest of the nodes will begin a new election after a predetermined timeout to select a new leader. Each node maintains a randomized election timer that represents the amount of time the node will wait before calling for a new election and selecting itself as a candidate.
If the node does not hear from the leader before a timeout occurs, the node begins a new election by starting a new term, marking itself as a candidate, and asking for votes from the other nodes. Each node votes for the first candidate that requests its vote. If a candidate receives a vote from the majority of the nodes in the cluster, it becomes the new leader. Since the election timeout differs on each node, the first candidate often becomes the new leader. However, if multiple candidates exist and receive the same number of votes, the existing election term will end without a leader and a new term will begin with new randomized election timers.
As mentioned above, any changes must be directed to the leader node. Rather than accepting and committing the change immediately, Etcd uses the Raft algorithm to ensure that the majority of nodes all agree on the change. The leader sends the proposed new value to each node in the cluster. The nodes then send a message confirming receipt of the new value. If the majority of nodes confirm receipt, the leader commits the new value and messages each node that the value is committed to the log. This means that each change requires a quorum from the cluster nodes in order to be committed.

Etcd in Kubernetes

Since its adoption as part of Kubernetes in 2014, the Etcd community has grown exponentially. There are many contributing members including CoreOS, Google, Redhat, IBM, Cisco, Huawei and more. Etcd is used successfully in production environments by large cloud providers such as AWS, Google Cloud Platform, and Azure.
Etcd’s job within Kubernetes is to safely store critical data for distributed systems. It’s best known as Kubernetes’ primary datastore used to store its configuration data, state, and metadata. Since Kubernetes usually runs on a cluster of several machines, it is a distributed system that requires a distributed datastore like Etcd.
Etcd makes it easy to store data across a cluster and watch for changes, allowing any node from Kubernetes cluster to read and write data. Etcd’s watch functionality is used by Kubernetes to monitor changes to either the actual or the desired state of its system. If they are different, Kubernetes makes changes to reconcile the two states. Every read by the kubectl command is retrieved from data stored in Etcd, any change made (kubectl apply) will create or update entries in Etcd, and every crash will trigger value changes in etcd.

Deployment and Hardware Recommendations

For testing or development purposes, Etcd can run on a laptop or a light cloud setup. However, when running Etcd clusters in production, we should take in consideration the guidelines offered by Etcd’s official documentation. The page offers a good starting point for a robust production deployment. Things to keep in mind:
  • Since Etcd writes data to disk, SSD is highly recommended
  • Always use an odd number of cluster members as quorum is needed to agree on updates to the cluster state
  • For performance reasons, clusters should usually not have more than seven nodes

In general, this issue is caused by not properly restoring a failed member/cluster.   


etcd comes with support for incremental runtime reconfiguration, which allows users to update the membership of the cluster at run time.
Reconfiguration requests can only be processed when a majority of cluster members are functioning. It is highly recommended to always have a cluster size greater than two in production. It is unsafe to remove a member from a two member cluster. The majority of a two member cluster is also two. If there is a failure during the removal process, the cluster might not be able to make progress and need to restart from majority failure.
To better understand the design behind runtime reconfiguration, please read :
Runtime reconfiguration is one of the hardest and most error prone features in a distributed system, especially in a consensus based system like etcd.
Read on to learn about the design of etcd's runtime reconfiguration commands and how we tackled these problems.

Two phase config changes keep the cluster safe

In etcd, every runtime reconfiguration has to go through two phases for safety reasons. For example, to add a member, first inform the cluster of the new configuration and then start the new member.
Phase 1 - Inform cluster of new configuration
To add a member into an etcd cluster, make an API call to request a new member to be added to the cluster. This is the only way to add a new member into an existing cluster. The API call returns when the cluster agrees on the configuration change.
Phase 2 - Start new member
To join the new etcd member into the existing cluster, specify the correct initial-cluster and set initial-cluster-state to existing. When the member starts, it will contact the existing cluster first and verify the current cluster configuration matches the expected one specified in initial-cluster. When the new member successfully starts, the cluster has reached the expected configuration.
By splitting the process into two discrete phases users are forced to be explicit regarding cluster membership changes. This actually gives users more flexibility and makes things easier to reason about. For example, if there is an attempt to add a new member with the same ID as an existing member in an etcd cluster, the action will fail immediately during phase one without impacting the running cluster. Similar protection is provided to prevent adding new members by mistake. If a new etcd member attempts to join the cluster before the cluster has accepted the configuration change, it will not be accepted by the cluster.
Without the explicit workflow around cluster membership etcd would be vulnerable to unexpected cluster membership changes. For example, if etcd is running under an init system such as systemd, etcd would be restarted after being removed via the membership API, and attempt to rejoin the cluster on startup. This cycle would continue every time a member is removed via the API and systemd is set to restart etcd after failing, which is unexpected.
We expect runtime reconfiguration to be an infrequent operation. We decided to keep it explicit and user-driven to ensure configuration safety and keep the cluster always running smoothly under explicit control.

Permanent loss of quorum requires new cluster

If a cluster permanently loses a majority of its members, a new cluster will need to be started from an old data directory to recover the previous state.
It is entirely possible to force removing the failed members from the existing cluster to recover. However, we decided not to support this method since it bypasses the normal consensus committing phase, which is unsafe. If the member to remove is not actually dead or force removed through different members in the same cluster, etcd will end up with a diverged cluster with same clusterID. This is very dangerous and hard to debug/fix afterwards.
With a correct deployment, the possibility of permanent majority loss is very low. But it is a severe enough problem that is worth special care. We strongly suggest reading the disaster recovery documentation and preparing for permanent majority loss before putting etcd into production.

Do not use public discovery service for runtime reconfiguration

The public discovery service should only be used for bootstrapping a cluster. To join member into an existing cluster, use the runtime reconfiguration API.
The discovery service is designed for bootstrapping an etcd cluster in a cloud environment, when the IP addresses of all the members are not known beforehand. After successfully bootstrapping a cluster, the IP addresses of all the members are known. Technically, the discovery service should no longer be needed.
It seems that using public discovery service is a convenient way to do runtime reconfiguration, after all discovery service already has all the cluster configuration information. However relying on public discovery service brings troubles:
  1. it introduces external dependencies for the entire life-cycle of the cluster, not just bootstrap time. If there is a network issue between the cluster and public discovery service, the cluster will suffer from it.
  2. public discovery service must reflect correct runtime configuration of the cluster during its life-cycle. It has to provide security mechanisms to avoid bad actions, and it is hard.
  3. public discovery service has to keep tens of thousands of cluster configurations. Our public discovery service backend is not ready for that workload.
To have a discovery service that supports runtime reconfiguration, the best choice is to build a private one.

Reconfiguration use cases

This section will walk through some common reasons for reconfiguring a cluster. Most of these reasons just involve combinations of adding or removing a member, which are explained below under Cluster Reconfiguration Operations.
Cycle or upgrade multiple machines
If multiple cluster members need to move due to planned maintenance (hardware upgrades, network downtime, etc.), it is recommended to modify members one at a time.
It is safe to remove the leader, however there is a brief period of downtime while the election process takes place. If the cluster holds more than 50MB of v2 data, it is recommended to migrate the member's data directory.
Change the cluster size
Increasing the cluster size can enhance failure tolerance and provide better read performance. Since clients can read from any member, increasing the number of members increases the overall serialized read throughput.
Decreasing the cluster size can improve the write performance of a cluster, with a trade-off of decreased resilience. Writes into the cluster are replicated to a majority of members of the cluster before considered committed. Decreasing the cluster size lowers the majority, and each write is committed more quickly.
Replace a failed machine
If a machine fails due to hardware failure, data directory corruption, or some other fatal situation, it should be replaced as soon as possible. Machines that have failed but haven't been removed adversely affect the quorum and reduce the tolerance for an additional failure.
To replace the machine, follow the instructions for removing the member from the cluster, and then add a new member in its place. If the cluster holds more than 50MB, it is recommended to migrate the failed member's data directory if it is still accessible.
Restart cluster from majority failure
If the majority of the cluster is lost or all of the nodes have changed IP addresses, then manual action is necessary to recover safely. The basic steps in the recovery process include creating a new cluster using the old data, forcing a single member to act as the leader, and finally using runtime configuration to add new members to this new cluster one at a time.

Cluster reconfiguration operations

With these use cases in mind, the involved operations can be described for each.
Before making any change, a simple majority (quorum) of etcd members must be available. This is essentially the same requirement for any kind of write to etcd.
All changes to the cluster must be done sequentially:
  • To update a single member peerURLs, issue an update operation
  • To replace a healthy single member, remove the old member then add a new member
  • To increase from 3 to 5 members, issue two add operations
  • To decrease from 5 to 3, issue two remove operations
All of these examples use the etcdctl command line tool that ships with etcd. To change membership without etcdctl, use the v2 HTTP members API or the v3 gRPC members API.
Update a member

Update advertise client URLs
To update the advertise client URLs of a member, simply restart that member with updated client urls flag (--advertise-client-urls) or environment variable (ETCD_ADVERTISE_CLIENT_URLS). The restarted member will self publish the updated URLs. A wrongly updated client URL will not affect the health of the etcd cluster.
Update advertise peer URLs
To update the advertise peer URLs of a member, first update it explicitly via member command and then restart the member. The additional action is required since updating peer URLs changes the cluster wide configuration and can affect the health of the etcd cluster.
To update the advertise peer URLs, first find the target member's ID. To list all members with etcdctl:
$ etcdctl member list
6e3bd23ae5f1eae0: name=node2 peerURLs=http://localhost:23802 clientURLs=http://127.0.0.1:23792
924e2e83e93f2560: name=node3 peerURLs=http://localhost:23803 clientURLs=http://127.0.0.1:23793
a8266ecf031671f3: name=node1 peerURLs=http://localhost:23801 clientURLs=http://127.0.0.1:23791
This example will update a8266ecf031671f3 member ID and change its peerURLs value to http://10.0.1.10:2380:
$ etcdctl member update a8266ecf031671f3 --peer-urls=http://10.0.1.10:2380
Updated member with ID a8266ecf031671f3 in cluster
Remove a member
Suppose the member ID to remove is a8266ecf031671f3. Use the remove command to perform the removal:
$ etcdctl member remove a8266ecf031671f3
Removed member a8266ecf031671f3 from cluster
The target member will stop itself at this point and print out the removal in the log:
etcd: this member has been permanently removed from the cluster. Exiting.
It is safe to remove the leader, however the cluster will be inactive while a new leader is elected. This duration is normally the period of election timeout plus the voting process.
Add a new member
Adding a member is a two step process:
  • Add the new member to the cluster via the HTTP members API, the gRPC members API, or the etcdctl member add command.
  • Start the new member with the new cluster configuration, including a list of the updated members (existing members + the new member).
etcdctl adds a new member to the cluster by specifying the member's name and advertised peer URLs:
$ etcdctl member add infra3 --peer-urls=http://10.0.1.13:2380
added member 9bf1b35fc7761a23 to cluster

ETCD_NAME="infra3"
ETCD_INITIAL_CLUSTER="infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380,infra3=http://10.0.1.13:2380"
ETCD_INITIAL_CLUSTER_STATE=existing
etcdctl has informed the cluster about the new member and printed out the environment variables needed to successfully start it. Now start the new etcd process with the relevant flags for the new member:
$ export ETCD_NAME="infra3"
$ export ETCD_INITIAL_CLUSTER="infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380,infra3=http://10.0.1.13:2380"
$ export ETCD_INITIAL_CLUSTER_STATE=existing
$ etcd --listen-client-urls http://10.0.1.13:2379 --advertise-client-urls http://10.0.1.13:2379 --listen-peer-urls http://10.0.1.13:2380 --initial-advertise-peer-urls http://10.0.1.13:2380 --data-dir %data_dir%
The new member will run as a part of the cluster and immediately begin catching up with the rest of the cluster.
If adding multiple members the best practice is to configure a single member at a time and verify it starts correctly before adding more new members. If adding a new member to a 1-node cluster, the cluster cannot make progress before the new member starts because it needs two members as majority to agree on the consensus. This behavior only happens between the time etcdctl member add informs the cluster about the new member and the new member successfully establishing a connection to the existing one.
Add a new member as learner
In order to make the process of adding a new member safer, and to reduce cluster downtime when the new member is added, it is recommended that the new member is added to cluster as a learner until it catches up. This can be described as a three step process:
  • Add the new member as learner via gRPC members API or the etcdctl member add --learner command.
  • Start the new member with the new cluster configuration, including a list of the updated members (existing members + the new member). This step is exactly the same as before.
  • Promote the newly added learner to voting member via gRPC members API or the etcdctl member promote command. etcd server validates promote request to ensure its operational safety. Only after its raft log has caught up to leader’s can learner be promoted to a voting member. If a learner member has not caught up to leader's raft log, member promote request will fail (see error cases when promoting a member section for more details). In this case, user should wait and retry later.
In v3.4, etcd server limits the number of learners that cluster can have to one. The main consideration is to limit the extra workload on leader due to propagating data from leader to learner.
Use etcdctl member add with flag --learner to add new member to cluster as learner.
$ etcdctl member add infra3 --peer-urls=http://10.0.1.13:2380 --learner
Member 9bf1b35fc7761a23 added to cluster a7ef944b95711739

ETCD_NAME="infra3"
ETCD_INITIAL_CLUSTER="infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380,infra3=http://10.0.1.13:2380"
ETCD_INITIAL_CLUSTER_STATE=existing
After new etcd process is started for the newly added learner member, use etcdctl member promote to promote learner to voting member.
$ etcdctl member promote 9bf1b35fc7761a23
Member 9e29bbaa45d74461 promoted in cluster a7ef944b95711739
Error cases when adding members
In the following case a new host is not included in the list of enumerated nodes. If this is a new cluster, the node must be added to the list of initial cluster members.
$ etcd --name infra3 \
  --initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
  --initial-cluster-state existing
etcdserver: assign ids error: the member count is unequal
exit 1
In this case, give a different address (10.0.1.14:2380) from the one used to join the cluster (10.0.1.13:2380):
$ etcd --name infra4 \
  --initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380,infra4=http://10.0.1.14:2380 \
  --initial-cluster-state existing
etcdserver: assign ids error: unmatched member while checking PeerURLs
exit 1
If etcd starts using the data directory of a removed member, etcd automatically exits if it connects to any active member in the cluster:
$ etcd
etcd: this member has been permanently removed from the cluster. Exiting.
exit 1
Error cases when adding a learner member
Cannot add learner to cluster if the cluster already has 1 learner (v3.4).
$ etcdctl member add infra4 --peer-urls=http://10.0.1.14:2380 --learner
Error: etcdserver: too many learner members in cluster
Error cases when promoting a learner member
Learner can only be promoted to voting member if it is in sync with leader.
$ etcdctl member promote 9bf1b35fc7761a23
Error: etcdserver: can only promote a learner member which is in sync with leader
Promoting a member that is not a learner will fail.

$ etcdctl member promote 9bf1b35fc7761a23
Error: etcdserver: can only promote a learner member
Promoting a member that does not exist in cluster will fail.
$ etcdctl member promote 12345abcde
Error: etcdserver: member not found
Strict reconfiguration check mode (-strict-reconfig-check)
As described in the above, the best practice of adding new members is to configure a single member at a time and verify it starts correctly before adding more new members. This step by step approach is very important because if newly added members is not configured correctly (for example the peer URLs are incorrect), the cluster can lose quorum. The quorum loss happens since the newly added member are counted in the quorum even if that member is not reachable from other existing members. Also quorum loss might happen if there is a connectivity issue or there are operational issues.
For avoiding this problem, etcd provides an option -strict-reconfig-check. If this option is passed to etcd, etcd rejects reconfiguration requests if the number of started members will be less than a quorum of the reconfigured cluster.
It is enabled by default.
Name    Status    Containers Ready    Container Restarts    Age
master-etcd-test.test.com    Crash Loop Back-off    0/1    11    a minute


Alert: FIRING:1] EtcdInsufficientPeers (true critical)
Labels:
  • alertname = EtcdInsufficientPeers
  • severity = critical




[root@test ~]# etcdctl2 cluster-health
Error response from daemon: Container 1ae9c6892b94ca47f4ac425fee5dfa14ab9f3e918fcf44260d7120ce500d120b is not running

020-06-17 06:27:31.480335 I | etcdmain: etcd Version: 3.2.22
2020-06-17 06:27:31.480353 I | etcdmain: Git SHA: 1674e682f
2020-06-17 06:27:31.480358 I | etcdmain: Go Version: go1.8.7
2020-06-17 06:27:31.480363 I | etcdmain: Go OS/Arch: linux/amd64
2020-06-17 06:27:31.480373 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2020-06-17 06:27:31.480447 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-06-17 06:27:31.480487 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2020-06-17 06:27:31.481710 I | embed: listening for peers on https://192.168.0.35:2380
2020-06-17 06:27:31.481808 I | embed: listening for client requests on 192.168.0.35:2379
2020-06-17 06:27:31.922164 W | snap: skipped unexpected non snapshot file tmp190891106
2020-06-17 06:27:31.922187 W | snap: skipped unexpected non snapshot file 00000000171f52d4.snap.db
2020-06-17 06:27:31.922192 W | snap: skipped unexpected non snapshot file tmp186061401
2020-06-17 06:27:31.927807 I | etcdserver: recovered store from snapshot at index 405731193
2020-06-17 06:27:31.979587 I | mvcc: restore compact to 356177965
2020-06-17 06:27:32.100048 I | etcdserver: name = test.test.com
2020-06-17 06:27:32.100082 I | etcdserver: data dir = /var/lib/etcd/
2020-06-17 06:27:32.100093 I | etcdserver: member dir = /var/lib/etcd/member
2020-06-17 06:27:32.100100 I | etcdserver: heartbeat = 500ms
2020-06-17 06:27:32.100109 I | etcdserver: election = 2500ms
2020-06-17 06:27:32.100119 I | etcdserver: snapshot count = 100000
2020-06-17 06:27:32.100187 I | etcdserver: advertise client URLs = https://192.168.0.35:2379
2020-06-17 06:27:32.100447 C | etcdserver: open wal error: wal: file not found









 

I had initially ran the below steps and etcd pod was up and seeing this error as some parameters were not given properly so kindly make sure you use the parmeters from etcd.conf file correctly


etcd member  has already been bootstrapped" error, then exits.

 

 




If you lost your data dir, you have to
  1. remove that member via dynamic configuration API
  2. start etcd with previous configuration
You cannot simply use the previous configuration to restart the etcd process without removing the previous member.
Remember that if you lose a data-dir, then you lose that member.
remove the old member via dynamic configuration API
add the new member via dynamic configuration API
remove all residuals datas : `rm -rf /var/lib/etcd2/*`
start etcd with initial-cluster=existing
If you provision a n number of nodes in etcd cluster and later re-provision one of the nodes,
it will not be allowed to join the other two automatically and etcd2 will show an error like "member xxx is already bootstrapped".
The old etcd node and freshly installed new etcd node are different. Either provision a fresh cluster or manually register the new node into the existing cluster. This is a data safety aspect of running etcd clusters, its not related to matchbox or etcd proxies.

 

 



etcd does not survive restarts

If etcd is restarted or if a node is rebooted, etcd cannot start:
Oct 14 20:45:12 ip-10-0-175-75.ec2.internal systemd[1]: Started etcd.
Oct 14 20:45:12 ip-10-0-175-75.ec2.internal docker[30348]: 2015/10/14 20:45:12 etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 4
Oct 14 20:45:12 ip-10-0-175-75.ec2.internal docker[30348]: 2015/10/14 20:45:12 etcdmain: no data-dir provided, using default data-dir ./resching-aws-worker-002.etcd
Oct 14 20:45:12 ip-10-0-175-75.ec2.internal docker[30348]: 2015/10/14 20:45:12 etcdmain: listening for peers on http://0.0.0.0:2380
Oct 14 20:45:12 ip-10-0-175-75.ec2.internal docker[30348]: 2015/10/14 20:45:12 etcdmain: listening for client requests on http://0.0.0.0:2379
Oct 14 20:45:12 ip-10-0-175-75.ec2.internal docker[30348]: 2015/10/14 20:45:12 etcdmain: stopping listening for client requests on http://0.0.0.0:2379
Oct 14 20:45:12 ip-10-0-175-75.ec2.internal docker[30348]: 2015/10/14 20:45:12 etcdmain: stopping listening for peers on http://0.0.0.0:2380
Oct 14 20:45:12 ip-10-0-175-75.ec2.internal docker[30348]: 2015/10/14 20:45:12 etcdmain: member 3086be9801215576 has already been bootstrapped
Oct 14 20:45:12 ip-10-0-175-75.ec2.internal systemd[1]: , code=exited, status=1/FAILURE
I confirm the issue is related to the cluster state --initial-cluster-state new. It should be changed to --initial-cluster-state existing when an etcd node is restarted in order to join the cluster correctly.
A possible solution can be as following:
  1. Bootstrap etcd cluster with --initial-cluster-state new for each node.
  2. Update systemd unit file with --initial-cluster-state existing.
  3. Reload systemd daemon.

 

Restoring etcd v3 cluster when it doesn't start with "open wal error: wal: file not found" error.


We need etcdctl3 tool. If it doesn't exist, please follow below steps:
To create a backup (a snapshot) of the current status of your cluster, first download the new version of etcdctl from the website:
wget https://github.com/coreos/etcd/releases/download/v3.2.0/etcd-v3.2.0-linux-amd64.tar.gz
tar xvf etcd-v3.2.0-linux-amd64.tar.gz








Check status and logs
etcdctl3 endpoint status
docker ps -a | grep etcd #Check status and get CONTAINER ID
docker logs <CONTAINER ID>

 

 

 





Steps to Fix the Error

  
Stop working containers
mkdir /etc/origin/node/pods-stopped<CURRENT DATE>
mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped<CURRENT DATE>
  • docker stop <api and etcd containers if any>
  • docker rm <api and etcd containers if any>
systemctl stop origin-node
Backup etcd configuration and database:
mkdir /root/etcd-conf<CURRENT DATE>
cp -R /etc/etcd /root/etcd-conf/
cp -R /var/lib/etcd/member/snap/db /root/etcd-backup
Be sure that all mentioned files have been copied correctly. Double check sizes in the source and destination folders.

Remove current data-dir:
rm -rf /var/lib/etcd





Restore snapshot:
etcdctl3 snapshot restore /root/etcd-backup/db \ #what we are going to restore
--skip-hash-check=true \ #This option should be added if we use copied database
--name master-1 \ # get this parameter from etcd.conf file
--data-dir /var/lib/etcd \ # get this parameter from etcd.conf file
--initial-cluster "master-1=https://192.168.0.13:2380" \ # get this parameter from etcd.conf file
--initial-cluster-token "etcd-cluster-1" \ # get this parameter from etcd.conf file
--initial-advertise-peer-urls https://192.168.0.13:2380 # get this parameter from etcd.conf file
etcdctl3 snapshot restore /root/etcd-backup/db --skip-hash-check=true --name master-1 --data-dir /var/lib/etcd --initial-cluster "master-1=https://192.168.0.13:2380" --initial-cluster-token "etcd-cluster-1" --initial-advertise-peer-urls https://192.168.0.13:2380












etcdctl3 snapshot restore /root/etcd-backup --skip-hash-check=true --name test.test.com --data-dir /var/lib/etcd --initial-cluster "test.test.com=https://10.168.0.32:2380,test.test.com=https://10.168.0.35:2380,testosh01ms03cn.test.com=https://10.168.0.31:2380" --initial-cluster-token "etcd-cluster-1" --initial-advertise-peer-urls https://10.168.0.35:2380
2020-06-17 09:05:55.913481 I | mvcc: restore compact to 356177965
2020-06-17 09:05:56.261725 I | etcdserver/membership: added member 7d3653e244f11cff https://10.168.0.32:2380 to cluster 9b5105a1991627f4
2020-06-17 09:05:56.261812 I | etcdserver/membership: added member 7f806b343f803275 https://10.168.0.35:2380 to cluster 9b5105a1991627f4
2020-06-17 09:05:56.261847 I | etcdserver/membership: added member b4f4d7535da1e566 https://10.168.0.31:2380 to cluster 9b5105a1991627f








mv /etc/origin/node/pods-stopped<CURRENT DATE>/etcd.yaml /etc/origin/node/pods/



Check to make sure that the required files are there in/etc/origin/node/pods when you start the origin-node.
  • etcd.yaml
  • controller.yaml
  • apiserver.yaml

systemctl start origin-node
#To ensure that etcd is started
docker ps
docker logs <CONTAINER ID>





if etcd is running correctly and we need only one node etcd cluster then
mv /etc/origin/node/pods-stopped<CURRENT DATE>/* /etc/origin/node/pods/
else continue with adding etcd nodes



   
[root@test ~]# etcdctl3 endpoint status
https://test.test.com:2379, 7f806b343f803275, 3.2.22, 934 MB, false, 8721, 407301820


[root@test ~]# docker container ls -a
CONTAINER ID        IMAGE                                                                                                                                           COMMAND                  CREATED             STATUS              PORTS               NAMES
fc596071ae4b        3a9e1a01ad12                                                                                                                                    "/bin/prometheus -..."   About an hour ago   Up About an hour                        k8s_prometheus_prometheus-k8s-0_prom-test-monitoring_b21f0cc1-b09b-11ea-b039-e62e166799f0_1
99f0730e8ded        3129a2ca29d7                                                                                                                                    "/configmap-reload..."   About an hour ago   Up About an hour                        k8s_rules-configmap-reloader_prometheus-k8s-0_prom-test-monitoring_b21f0cc1-b09b-11ea-b039-e62e166799f0_0
7367329c043e        643e7e64829e                                                                                                                                    "/bin/prometheus-c..."   About an hour ago   Up About an hour                        k8s_prometheus-config-reloader_prometheus-k8s-0_prom-test-monitoring_b21f0cc1-b09b-11ea-b039-e62e166799f0_0
b1a4aaa41fba        ff5dd2137a4f                                                                                                                                    "/bin/sh -c '#!/bi..."   About an hour ago   Up About an hour                        k8s_etcd_master-etcd-test.test.com_kube-system_0e1bb132d87ff0f749d574b847a915b7_1
dae549150320        4b12372974a9                                                                                                                                    "/bin/bash -c '#!/..."   About an hour ago   Up About an hour                        k8s_api_master-api-test.test.com_kube-system_d400ab198a15ccbe597efa10e38d0719_1
62dfe877fecc        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           About an hour ago   Up About an hour                        k8s_POD_prometheus-k8s-0_prom-test-monitoring_b21f0cc1-b09b-11ea-b039-e62e166799f0_0
36a443b9b3a3        cc70e4dbe05b                                                                                                                                    "/usr/bin/openshif..."   About an hour ago   Up About an hour                        k8s_router_router-6-k7z7f_default_b3e6b055-b09a-11ea-b039-e62e166799f0_0
86897e5d7a28        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           About an hour ago   Up About an hour                        k8s_POD_router-6-k7z7f_default_b3e6b055-b09a-11ea-b039-e62e166799f0_0
1f7affe44912        4b12372974a9                                                                                                                                    "/bin/bash -c '#!/..."   About an hour ago   Up About an hour                        k8s_controllers_master-controllers-test.test.com_kube-system_04335d3709e8228377883a762ef7f830_0
53039dcdd135        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           About an hour ago   Up About an hour                        k8s_POD_master-controllers-test.test.com_kube-system_04335d3709e8228377883a762ef7f830_0
9c227a8aee68        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           About an hour ago   Up About an hour                        k8s_POD_master-api-test.test.com_kube-system_d400ab198a15ccbe597efa10e38d0719_0
78ef6027b248        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           About an hour ago   Up About an hour                        k8s_POD_master-etcd-test.test.com_kube-system_0e1bb132d87ff0f749d574b847a915b7_0
c3ef123bbe49        214de829baf6                                                                                                                                    "/usr/bin/template..."   About an hour ago   Up About an hour                        k8s_c_apiserver-9hx5c_openshift-template-service-broker_3ad02b0f-a255-11e9-955b-fa163ee7bbf7_26
fa3abeb08ace        eda829aae0a0                                                                                                                                    "/usr/bin/service-..."   About an hour ago   Up About an hour                        k8s_apiserver_apiserver-kd7lv_kube-service-catalog_ecea8c10-a254-11e9-955b-fa163ee7bbf7_23
c7c15030a530        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           About an hour ago   Up About an hour                        k8s_POD_apiserver-kd7lv_kube-service-catalog_ecea8c10-a254-11e9-955b-fa163ee7bbf7_2
4bfc912f764a        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           About an hour ago   Up About an hour                        k8s_POD_apiserver-9hx5c_openshift-template-service-broker_3ad02b0f-a255-11e9-955b-fa163ee7bbf7_2
769f1487a221        eda829aae0a0                                                                                                                                    "/usr/bin/service-..."   12 hours ago        Up 12 hours                             k8s_controller-manager_controller-manager-4t6zm_kube-service-catalog_f0f1a1fc-a254-11e9-955b-fa163ee7bbf7_126
e807a194a511        ddd58ba217b0                                                                                                                                    "/bin/bash -c '#!/..."   4 months ago        Up 4 months                             k8s_sync_sync-s8wxh_openshift-node_924f7d1d-a253-11e9-955b-fa163ee7bbf7_7
d17672ee8472        b3e7f67a1480                                                                                                                                    "/bin/node_exporte..."   4 months ago        Up 4 months                             k8s_node-exporter_node-exporter-knk27_prom-test-monitoring_011b9414-43e2-11ea-908a-fa163e93b720_0
ffa7dbfce783        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           4 months ago        Up 4 months                             k8s_POD_node-exporter-knk27_prom-test-monitoring_011b9414-43e2-11ea-908a-fa163e93b720_0
34e1a3a16985        04fcff75b160                                                                                                                                    "/usr/local/bin/do..."   6 months ago        Up 6 months                             k8s_filebeat_filebeat-gmwq8_logging-filebeat_515323ff-1b59-11ea-b6ec-fa163e87ca3f_1
3eca0aa2c817        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           6 months ago        Up 6 months                             k8s_POD_filebeat-gmwq8_logging-filebeat_515323ff-1b59-11ea-b6ec-fa163e87ca3f_0
baa83dbc49c1        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           10 months ago       Up 10 months                            k8s_POD_controller-manager-4t6zm_kube-service-catalog_f0f1a1fc-a254-11e9-955b-fa163ee7bbf7_1
564d1e2a78cf        ddd58ba217b0                                                                                                                                    "/bin/bash -c '#!/..."   10 months ago       Up 10 months                            k8s_sdn_sdn-48tsf_openshift-sdn_ab862230-a253-11e9-955b-fa163ee7bbf7_1
a0f4b2c07b21        ddd58ba217b0                                                                                                                                    "/bin/bash -c '#!/..."   10 months ago       Up 10 months                            k8s_openvswitch_ovs-6z7zv_openshift-sdn_ab1fc31c-a253-11e9-955b-fa163ee7bbf7_1
8e040e4f8888        artifactorycn.test.com:17011/lipatovp/collectd-docker-bobrik-57@sha256:10b0beec7e7e3790793f9c19978c3703e7cdd1ab734b079ed9e5630262f9321e   "/run.sh"                10 months ago       Up 10 months                            k8s_collectd_collectd-node-agent-h5cjv_collectd-ds_4b0771e7-a7a1-11e9-93c8-fa163e87ca3f_0
06277b1ca333        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           10 months ago       Up 10 months                            k8s_POD_collectd-node-agent-h5cjv_collectd-ds_4b0771e7-a7a1-11e9-93c8-fa163e87ca3f_0
0f56f2a140b3        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           10 months ago       Up 10 months                            k8s_POD_sync-s8wxh_openshift-node_924f7d1d-a253-11e9-955b-fa163ee7bbf7_0
0804fa6448bf        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           10 months ago       Up 10 months                            k8s_POD_sdn-48tsf_openshift-sdn_ab862230-a253-11e9-955b-fa163ee7bbf7_0
1e05d74d60fc        docker.io/openshift/origin-pod:v3.11.0                                                                                                          "/usr/bin/pod"           10 months ago       Up 10 months                            k8s_POD_ovs-6z7zv_openshift-sdn_ab1fc31c-a253-11e9-955b-fa163ee7bbf7_0







systemctl show etcd --property=ActiveState,SubState
ActiveState=failed
SubState=failed

1 Comments

  1. Open Wal Error: Wal: File Not Found In Etcd ~ System Admin Share >>>>> Download Now

    >>>>> Download Full

    Open Wal Error: Wal: File Not Found In Etcd ~ System Admin Share >>>>> Download LINK

    >>>>> Download Now

    Open Wal Error: Wal: File Not Found In Etcd ~ System Admin Share >>>>> Download Full

    >>>>> Download LINK

    ReplyDelete
Previous Post Next Post