当前位置：网站首页>Proxmox cluster node crash handling

Proxmox cluster node crash handling

2022-06-29 20:05:00 【Full stack programmer webmaster】

Problem description

Add a physical node to the existing cluster , Then create this node ceph The monitor 、 establish OSD. From the host system ceph osd tree Check the status , Created several OSD The status is normal （up）, from proxmox The same is true of the management interface .

Suddenly I don't know why , The newly joined node cannot fail from the cluster .

Check the host system again OSD state , I can't believe up become down. The new node has no data , So try restarting , See if it can be normal . After the restart , Network connectivity ,ssh Can't connect ,web The management interface is also inaccessible . Next , You need to evacuate the failed node from the cluster first , After recovery , Then join the cluster .

Delete the failed node from the cluster

There are two steps in the order of operation ： Remove the fault from the cluster ceph And deleting physical nodes from the cluster .

ü Remove the fault from the cluster ceph

1. Log in to any physical normal node system of the cluster , Execute the following command to view ceph osd state ：

[email protected]:~# ceph osd tree

ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF

-1         18.00357 root default                          

-3          4.91006     host pve48                        

 0     hdd  1.63669         osd.0      up    1.00000 1.00000

 1     hdd  1.63669         osd.1      up    1.00000 1.00000

 2     hdd  1.63669         osd.2        up  1.00000 1.00000

-5          4.91006     host pve49                        

 3     hdd  1.63669         osd.3      up    1.00000 1.00000

 4     hdd  1.63669         osd.4      up    1.00000 1.00000

 5     hdd  1.63669         osd.5      up    1.00000 1.00000

-7          4.91006     host pve50                        

 6     hdd  1.63669         osd.6      up    1.00000 1.00000

 7     hdd  1.63669         osd.7      up    1.00000 1.00000

 8     hdd  1.63669         osd.8      up    1.00000 1.00000

-9          3.27338     host pve51                        

9           hdd  1.63669         osd.9    down        0 1.00000

10     hdd  1.63669         osd.10   down          0 1.00000

From the output, we can know the physical nodes pve51 Of the two OSD There is a problem , You need to remove .

2. Offline problems ceph osd, The operations performed are as follows ：

[email protected]:~# ceph osd out osd.9

osd.9 is already out.

[email protected]:~# ceph osd out osd.10

osd.10 is already out.

Operate carefully , Don't take the normal osd It's offline .

3. Delete offline osd Authentication information , The operations performed are as follows ：

[email protected]:~# ceph auth del osd.9

updated

[email protected]:~# ceph auth del osd.10

updated

4. Completely delete the fault osd, The operation is as follows ：

[email protected]:~# ceph osd rm 9

removed osd.9

[email protected]:~# ceph osd rm 10

removed osd.10

Be careful ： This operation ceph The last column of parameters is different from the previous one , It is a pure digital format ！！！

5. View cluster osd state , The operation is as follows ：

[email protected]:~# ceph osd tree

ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF

-1         18.00357 root default                             

-3          4.91006     host pve48                        

 0     hdd  1.63669         osd.0      up    1.00000 1.00000

 1     hdd  1.63669         osd.1      up    1.00000 1.00000

 2     hdd  1.63669         osd.2      up    1.00000 1.00000

-5          4.91006     host pve49                        

 3     hdd  1.63669         osd.3      up    1.00000 1.00000

 4     hdd  1.63669         osd.4      up    1.00000 1.00000

 5     hdd  1.63669         osd.5      up    1.00000 1.00000

-7          4.91006     host pve50                        

 6     hdd  1.63669         osd.6      up    1.00000 1.00000

 7     hdd  1.63669         osd.7      up    1.00000 1.00000

 8     hdd  1.63669         osd.8      up    1.00000 1.00000

-9          3.27338     host pve51                         

9           hdd  1.63669         osd.9     DNE        0        

10     hdd  1.63669         osd.10    DNE          0

After the operation is completed , Fault node osd Status from down Turned into DNE

6. Delete the... Of the failed node ceph disk , The operation is as follows ：

[email protected]:~# ceph osd crush rm osd.9

removed item id 9 name ‘osd.9’ from crush   map

[email protected]:~# ceph osd crush rm osd.10

removed item id 10 name ‘osd.10’ from crush   map

7. from ceph Delete physical nodes in the cluster , The operation is as follows ：

[email protected]:~# ceph osd crush rm  pve51

removed item id -9 name ‘pve51’ from crush   map

8. Execution instruction ceph osd tree Check the status , See if the fault node is [email protected]:~# ceph osd crush rm pve51 removed item id -9 name ‘pve51’ from crush map from ceph Clean up the cluster .

ü Delete the failed node from the cluster

Ø Operations on the cluster

[email protected]:~# pvecm  delnode pve51

Killing   node 4

Ø Recovery operation of failed machine

It's better to kill them all , Reinstall the system , And use the new ip Address , To join the cluster .

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/101292.html Link to the original text ：https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/180/202206291956461988.html

当前位置：网站首页>Proxmox cluster node crash handling

Proxmox cluster node crash handling

边栏推荐

猜你喜欢

随机推荐