当前位置:网站首页>Greenplum数据库故障分析——版本升级后gpstart -a为何返回失败
Greenplum数据库故障分析——版本升级后gpstart -a为何返回失败
2022-08-05 01:47:00 【肥叔菌】
案例背景
现场项目进行Greenplum数据库小版本升级时,升级脚本报错,提示数据库启动失败。但是我们从跳板机进入数据库节点使用gpstart交互模式启动集群时,集群是能够启动的,standby master是不可用的。什么原因造成了使用gpstart -a启动失败,使用gpstart启动就能成功呢?作为团队入职两年的小开发一枚,秉承通过故障分析才能快速切入学习数据库路径的原则,接下了这个活,难免需要加个班处理一下。
分析过程
首先我们发现gpstart和gpstart -a除了交互外,会尝试对standby master进行启动,如果不能启动则跳过。我们的排除方向也应该是这边,首先复现场景后,使用gpstart -m只拉起master节点,utility模式登陆进master节点,执行select * from gp_segment_configuration where content = -1;
查找master和standby master对应的记录。这里发现standby master在系统表中标记为正常,但是我们在standby master节点的gpseg-1目录下发现其数据文件并不是很全,比如就没有postgresql.conf。因此判定肯定是gpinitstandby脚本运行出错,查看日志如下所示:
gpinitstandby:xxx:gpadmin-[ERROR]:-Error initializing standby master: Standby master not configured
gpinitstandby:xxx:gpadmin-[ERROR]:-Request mode to remove warm master standby, but no standby located.
gpinitstandby:xxx:gpadmin-[ERROR]:-Error removing standby master: no standby configured
gpinitstandby:xxx:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
gpinitstandby:xxx:gpadmin-[INFO]:-------------------------------------------
gpinitstandby:xxx:gpadmin-[INFO]:Greenplum standby master initialization parameters
gpinitstandby:xxx:gpadmin-[INFO]:-------------------------------------------
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum master hostname = xxx
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum master data directory = /home/gpadmin/data/master/default/gpseg-1
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum master port = 5432
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum standby master hostname = xxx
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum standby master port = 5432
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum standby master data directory = /home/gpadmin/data/master/default/gpseg-1
gpinitstandby:xxx:gpadmin-[INFO]:-Greenplum update system catalog = On
gpinitstandby:xxx:gpadmin-[INFO]:-Syncing Greenplum Database extensions to standby
gpinitstandby:xxx:gpadmin-[INFO]:-The packages on xxx are consistent
gpinitstandby:xxx:gpadmin-[INFO]:-Adding standby master to catalog...
gpinitstandby:xxx:gpadmin-[INFO]:-Database catalog updated successfully.
gpinitstandby:xxx:gpadmin-[INFO]:-Updating pg_hba.conf file...
gpinitstandby:xxx:gpadmin-[INFO]:-pg_hba.conf files updated successfully.
gpinitstandby:xxx:gpadmin-[ERROR]:-Failed to copy data directory from master to standby.
gpinitstandby:xxx:gpadmin-[ERROR]:-Failed to create standby
gpinitstandby:xxx:gpadmin-[WARNING]-Trying to rollback changes that have been made...
gpinitstandby:xxx:gpadmin-[INFO]:-Rolling back catalog change...
gpinitstandby:xxx:gpadmin-[ERROR]:-Failed to remove standby from master catalog.
gpinitstandby:xxx:gpadmin-[INFO]:-Restoring pg_hba.conf file...
gpinitstandby:xxx:gpadmin-[INFO]:-Cleaning up pg_hba.conf backup files...
gpinitstandby:xxx:gpadmin-[INFO]:-Backup files of pg_hba.conf cleaned up successfully.
从上面日志可以看出在升级前HA组件在修复standby master,gpinitstandby运行到从master数据目录拷贝数据到standby master时,升级脚本关闭了greenplum集群,导致的失败。由于greenplum集群关闭了,回滚gp_segment_configuration中的standby记录失败。因此使用gpstart -a启动时,脚本认为standby master是正常的,就尝试去启动,当然会失败啦。
产生原因
在升级前HA组件在修复standby master,gpinitstandby运行到从master数据目录拷贝数据到standby master时,升级脚本关闭了greenplum集群,导致的失败。由于greenplum集群关闭了,回滚gp_segment_configuration中的standby记录失败。
解决方案
考虑三种方案:
- gpstart -am启动master节点;执行
PGOPTIONS="-c gp_session_role=utility" psql -d postgres -c "select gp_remove_master_standby()"
;执行gpstop -ar - gpstart -am启动master节点;执行
PGOPTIONS="-c gp_session_role=utility" psql -d postgres -c "set allow_system_table_modes=true; update gp_segment_configuration set status = 'd' where content = -1 an role = 'm'; "
;执行gpstop -ar - gpstart -aS。加上大写的S参数会直接跳过standby master启动
采用第3中方案,升级后由HA来处理standby master的修复启动问题。
边栏推荐
- 英特尔 XDC 2022 精彩回顾:共建开放生态,释放“基建”潜能
- 新唐NUC980使用记录:在用户应用中使用GPIO
- [Endnote] Word inserts a custom form of Endnote document format
- Creative code confession
- 迁移学习——Distant Domain Transfer Learning
- GCC: paths to header and library files
- 跨域解决方案
- 行业现状?互联网公司为什么宁愿花20k招人,也不愿涨薪留住老员工~
- "Configuration" is a double-edged sword, it will take you to understand various configuration methods
- 【七夕如何根据情侣倾听的音乐进行薅羊毛】背景音乐是否会影响情侣对酒的选择
猜你喜欢
随机推荐
1349. Maximum number of students taking the exam Status Compression
【机器学习】21天挑战赛学习笔记(二)
[Machine Learning] 21-day Challenge Study Notes (2)
MySQL learning
[parameters of PyQT5 binding functions]
2022 EdgeX中国挑战赛8月3日即将盛大开幕
GCC:头文件和库文件的路径
4. PCIe 接口时序
Creative code confession
JWT简单介绍
GCC: compile-time library path and runtime library path
tcp中的三次握手与四次挥手
How DHCP works
sqlite--nested exception is org.apache.ibatis.exceptions.PersistenceException:
Knowledge Points for Network Planning Designers' Morning Questions in November 2021 (Part 2)
习题:选择结构(一)
source program in assembly language
“配置”是把双刃剑,带你了解各种配置方法
4. PCIe interface timing
多线程涉及的其它知识(死锁(等待唤醒机制),内存可见性问题以及定时器)