当前位置:网站首页>Greenplum数据库故障分析——版本升级后gpstart -a为何返回失败
Greenplum数据库故障分析——版本升级后gpstart -a为何返回失败
2022-08-05 01:47:00 【肥叔菌】
案例背景
现场项目进行Greenplum数据库小版本升级时,升级脚本报错,提示数据库启动失败。但是我们从跳板机进入数据库节点使用gpstart交互模式启动集群时,集群是能够启动的,standby master是不可用的。什么原因造成了使用gpstart -a启动失败,使用gpstart启动就能成功呢?作为团队入职两年的小开发一枚,秉承通过故障分析才能快速切入学习数据库路径的原则,接下了这个活,难免需要加个班处理一下。
分析过程
首先我们发现gpstart和gpstart -a除了交互外,会尝试对standby master进行启动,如果不能启动则跳过。我们的排除方向也应该是这边,首先复现场景后,使用gpstart -m只拉起master节点,utility模式登陆进master节点,执行select * from gp_segment_configuration where content = -1;
查找master和standby master对应的记录。这里发现standby master在系统表中标记为正常,但是我们在standby master节点的gpseg-1目录下发现其数据文件并不是很全,比如就没有postgresql.conf。因此判定肯定是gpinitstandby脚本运行出错,查看日志如下所示:
gpinitstandby:xxx:gpadmin-[ERROR]:-Error initializing standby master: Standby master not configured
gpinitstandby:xxx:gpadmin-[ERROR]:-Request mode to remove warm master standby, but no standby located.
gpinitstandby:xxx:gpadmin-[ERROR]:-Error removing standby master: no standby configured
gpinitstandby:xxx:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
gpinitstandby:xxx:gpadmin-[INFO]:-------------------------------------------
gpinitstandby:xxx:gpadmin-[INFO]:Greenplum standby master initialization parameters
gpinitstandby:xxx:gpadmin-[INFO]:-------------------------------------------
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum master hostname = xxx
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum master data directory = /home/gpadmin/data/master/default/gpseg-1
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum master port = 5432
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum standby master hostname = xxx
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum standby master port = 5432
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum standby master data directory = /home/gpadmin/data/master/default/gpseg-1
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum update system catalog = On
gpinitstandby:xxx:gpadmin-[INFO]:-Syncing Greenplum Database extensions to standby
gpinitstandby:xxx:gpadmin-[INFO]:-The packages on xxx are consistent
gpinitstandby:xxx:gpadmin-[INFO]:-Adding standby master to catalog...
gpinitstandby:xxx:gpadmin-[INFO]:-Database catalog updated successfully.
gpinitstandby:xxx:gpadmin-[INFO]:-Updating pg_hba.conf file...
gpinitstandby:xxx:gpadmin-[INFO]:-pg_hba.conf files updated successfully.
gpinitstandby:xxx:gpadmin-[ERROR]:-Failed to copy data directory from master to standby.
gpinitstandby:xxx:gpadmin-[ERROR]:-Failed to create standby
gpinitstandby:xxx:gpadmin-[WARNING]-Trying to rollback changes that have been made...
gpinitstandby:xxx:gpadmin-[INFO]:-Rolling back catalog change...
gpinitstandby:xxx:gpadmin-[ERROR]:-Failed to remove standby from master catalog.
gpinitstandby:xxx:gpadmin-[INFO]:-Restoring pg_hba.conf file...
gpinitstandby:xxx:gpadmin-[INFO]:-Cleaning up pg_hba.conf backup files...
gpinitstandby:xxx:gpadmin-[INFO]:-Backup files of pg_hba.conf cleaned up successfully.
从上面日志可以看出在升级前HA组件在修复standby master,gpinitstandby运行到从master数据目录拷贝数据到standby master时,升级脚本关闭了greenplum集群,导致的失败。由于greenplum集群关闭了,回滚gp_segment_configuration中的standby记录失败。因此使用gpstart -a启动时,脚本认为standby master是正常的,就尝试去启动,当然会失败啦。
产生原因
在升级前HA组件在修复standby master,gpinitstandby运行到从master数据目录拷贝数据到standby master时,升级脚本关闭了greenplum集群,导致的失败。由于greenplum集群关闭了,回滚gp_segment_configuration中的standby记录失败。
解决方案
考虑三种方案:
- gpstart -am启动master节点;执行
PGOPTIONS="-c gp_session_role=utility" psql -d postgres -c "select gp_remove_master_standby()"
;执行gpstop -ar - gpstart -am启动master节点;执行
PGOPTIONS="-c gp_session_role=utility" psql -d postgres -c "set allow_system_table_modes=true; update gp_segment_configuration set status = 'd' where content = -1 an role = 'm'; "
;执行gpstop -ar - gpstart -aS。加上大写的S参数会直接跳过standby master启动
采用第3中方案,升级后由HA来处理standby master的修复启动问题。
边栏推荐
- Interview summary: Why do interviewers in large factories always ask about the underlying principles of Framework?
- C language basics -- pointers
- [Machine Learning] 21-day Challenge Study Notes (2)
- 深度学习训练前快速批量修改数据集中的图片名
- Knowledge Points for Network Planning Designers' Morning Questions in November 2021 (Part 1)
- (17) 51 MCU - AD/DA conversion
- 动态规划/背包问题总结/小结——01背包、完全背包
- Why is this problem reported when installing oracle11
- 【翻译】CNCF对OpenTracing项目的存档
- [Word] #() error occurs after Word formula is exported to PDF
猜你喜欢
Leetcode brushing questions - 22. Bracket generation
Gartner Hype Cycle:超融合技术将在2年内到达“生产力成熟期”
MySQL3
10年测试经验,在35岁的生理年龄面前,一文不值
A new technical director, who calls DDD a senior, is convinced
MySQL学习
Why is this problem reported when installing oracle11
Jin Jiu Yin Shi Interview and Job-hopping Season; Are You Ready?
day14--postman interface test
[How to smash wool according to the music the couple listens to during the Qixi Festival] Does the background music affect the couple's choice of wine?
随机推荐
行业现状?互联网公司为什么宁愿花20k招人,也不愿涨薪留住老员工~
JVM类加载简介
ORA-00257
动态规划/背包问题总结/小结——01背包、完全背包
MySQL学习
The use of pytorch: temperature prediction using neural networks
Live playback including PPT download | Build Online Deep Learning based on Flink & DeepRec
深度学习训练前快速批量修改数据集中的图片名
C语言基础知识 -- 指针
(17) 51 MCU - AD/DA conversion
超越YOLO5-Face | YOLO-FaceV2正式开源Trick+学术点拉满
DHCP的工作过程
【Redis】Linux下Redis安装
VOC格式数据集转COCO格式数据集
MySQL3
记录谷歌gn编译时碰到的一个错误“I could not find a “.gn“ file ...”
Xunrui cms website cannot be displayed normally after relocation and server change
GCC: Shield dependencies between dynamic libraries
第09章 性能分析工具的使用【2.索引及调优篇】【MySQL高级】
Residential water problems