当前位置:网站首页>Proxmox cluster node crash handling
Proxmox cluster node crash handling
2022-06-29 20:05:00 【Full stack programmer webmaster】
Problem description
Add a physical node to the existing cluster , Then create this node ceph The monitor 、 establish OSD. From the host system ceph osd tree Check the status , Created several OSD The status is normal (up), from proxmox The same is true of the management interface .
Suddenly I don't know why , The newly joined node cannot fail from the cluster .
Check the host system again OSD state , I can't believe up become down. The new node has no data , So try restarting , See if it can be normal . After the restart , Network connectivity ,ssh Can't connect ,web The management interface is also inaccessible . Next , You need to evacuate the failed node from the cluster first , After recovery , Then join the cluster .
Delete the failed node from the cluster
There are two steps in the order of operation : Remove the fault from the cluster ceph And deleting physical nodes from the cluster .
ü Remove the fault from the cluster ceph
1. Log in to any physical normal node system of the cluster , Execute the following command to view ceph osd state :
[email protected]:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 18.00357 root default
-3 4.91006 host pve48
0 hdd 1.63669 osd.0 up 1.00000 1.00000
1 hdd 1.63669 osd.1 up 1.00000 1.00000
2 hdd 1.63669 osd.2 up 1.00000 1.00000
-5 4.91006 host pve49
3 hdd 1.63669 osd.3 up 1.00000 1.00000
4 hdd 1.63669 osd.4 up 1.00000 1.00000
5 hdd 1.63669 osd.5 up 1.00000 1.00000
-7 4.91006 host pve50
6 hdd 1.63669 osd.6 up 1.00000 1.00000
7 hdd 1.63669 osd.7 up 1.00000 1.00000
8 hdd 1.63669 osd.8 up 1.00000 1.00000
-9 3.27338 host pve51
9 hdd 1.63669 osd.9 down 0 1.00000
10 hdd 1.63669 osd.10 down 0 1.00000From the output, we can know the physical nodes pve51 Of the two OSD There is a problem , You need to remove .
2. Offline problems ceph osd, The operations performed are as follows :
[email protected]:~# ceph osd out osd.9
osd.9 is already out.
[email protected]:~# ceph osd out osd.10
osd.10 is already out.Operate carefully , Don't take the normal osd It's offline .
3. Delete offline osd Authentication information , The operations performed are as follows :
[email protected]:~# ceph auth del osd.9
updated
[email protected]:~# ceph auth del osd.10
updated4. Completely delete the fault osd, The operation is as follows :
[email protected]:~# ceph osd rm 9
removed osd.9
[email protected]:~# ceph osd rm 10
removed osd.10Be careful : This operation ceph The last column of parameters is different from the previous one , It is a pure digital format !!!
5. View cluster osd state , The operation is as follows :
[email protected]:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 18.00357 root default
-3 4.91006 host pve48
0 hdd 1.63669 osd.0 up 1.00000 1.00000
1 hdd 1.63669 osd.1 up 1.00000 1.00000
2 hdd 1.63669 osd.2 up 1.00000 1.00000
-5 4.91006 host pve49
3 hdd 1.63669 osd.3 up 1.00000 1.00000
4 hdd 1.63669 osd.4 up 1.00000 1.00000
5 hdd 1.63669 osd.5 up 1.00000 1.00000
-7 4.91006 host pve50
6 hdd 1.63669 osd.6 up 1.00000 1.00000
7 hdd 1.63669 osd.7 up 1.00000 1.00000
8 hdd 1.63669 osd.8 up 1.00000 1.00000
-9 3.27338 host pve51
9 hdd 1.63669 osd.9 DNE 0
10 hdd 1.63669 osd.10 DNE 0 After the operation is completed , Fault node osd Status from down Turned into DNE
6. Delete the... Of the failed node ceph disk , The operation is as follows :
[email protected]:~# ceph osd crush rm osd.9
removed item id 9 name ‘osd.9’ from crush map
[email protected]:~# ceph osd crush rm osd.10
removed item id 10 name ‘osd.10’ from crush map7. from ceph Delete physical nodes in the cluster , The operation is as follows :
[email protected]:~# ceph osd crush rm pve51
removed item id -9 name ‘pve51’ from crush map8. Execution instruction ceph osd tree Check the status , See if the fault node is [email protected]:~# ceph osd crush rm pve51 removed item id -9 name ‘pve51’ from crush map from ceph Clean up the cluster .
ü Delete the failed node from the cluster
Ø Operations on the cluster
Log in to any normal node in the cluster , Execute the following instructions to perform the expulsion operation :
[email protected]:~# pvecm delnode pve51
Killing node 4Ø Recovery operation of failed machine
It's better to kill them all , Reinstall the system , And use the new ip Address , To join the cluster .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/101292.html Link to the original text :https://javaforall.cn
边栏推荐
- SSH命令及使用说明
- Notepad++--宏(记录操作过程)
- 雪花id,分布式唯一id
- 【Try to Hack】vulnhub narak
- Configuration du Flume 4 - source personnalisée + sink
- Sword finger offer 66 Building a product array
- Nutch2.1在Windows平台上使用Eclipse debug 存储在MySQL的搭建过程
- JMeter BeanShell explanation and thread calling
- wangeditor富文本编辑器使用(详细)
- Ovirt database modify delete node
猜你喜欢

Flume配置1——基础案例
![[USB flash disk test] in order to transfer the data at the bottom of the pressure box, I bought a 2T USB flash disk, and the test result is only 47g~](/img/c3/e0637385d35943f1914477bb9f2b54.png)
[USB flash disk test] in order to transfer the data at the bottom of the pressure box, I bought a 2T USB flash disk, and the test result is only 47g~

Linux Installation mysql5

一个超赞的开源的图片去水印解决方案

Flume配置4——自定義Source+Sink

Flume configuration 2 - ganglia for monitoring

Sentinel的快速入门,三分钟带你体验流量控制

There is no small green triangle on the method in idea

Configuration du Flume 4 - source personnalisée + sink

Flume configuration 3 - interceptor filtering
随机推荐
PHP implementation extracts non repeated integers (programming topics can be the fastest familiar functions)
Understanding of software test logic coverage
如何设置 Pod 到指定节点运行
Following the crowd hurts you
Koa source code analysis
lock4j--分布式锁中间件--自定义获取锁失败的逻辑
The list of winners in the classic Smurfs of childhood: bluedad's digital collection was announced
Tag based augmented reality using OpenCV
通过MeterSphere和DataEase实现项目Bug处理进展实时跟进
画虎国手孟祥顺数字藏品限量发售,随赠虎年茅台
Etcd database source code analysis - put process of server
How to solve the problem of insufficient memory space in Apple iPhone upgrade system?
NLP - GIZA++ 实现词对齐
XSS漏洞
【编译原理】类型检查
软件工程—原理、方法与应用
14.04 million! Sichuan provincial human resources and social security department relational database and middleware software system upgrade procurement bidding!
[USB flash disk test] in order to transfer the data at the bottom of the pressure box, I bought a 2T USB flash disk, and the test result is only 47g~
雲服務器的安全設置常識
【精品】pinia详解