PostgreSQL高可用工具-repmgr-手动故障切换


集群安装配置:http://www.linuxmysql.com/14/2023/1194.htm

当主服务器发生宕机或服务异常不同用时,就需要让备服务器接管故障的主服务器,以确保整个repmgr集群对外可用,这时可以通过repmgr standby promotion完成备份服务器对主服务器的按管



查看现在集群的状态:
-bash-4.2$ repmgr cluster show
ID | Name   | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                             
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1  | node13 | primary | * running |          | default  | 3        | 1        | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2  | node14 | standby |   running | node13   | default  | 2        | 1        | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3  | node15 | standby |   running | node13   | default  | 1        | 1        | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4  | node16 | witness | * running | node13   | default  | 0        | n/a      | host=node16 port=5432 dbname=repmgr user=repmgr password=123456


现在模拟node13主库停机:
[root@node13 12]# systemctl stop postgresql-12


此时,集群处理部分禁用状态,集群不能对外提供写服务,需要手动恢复
在node14上查看集群的状态:
-bash-4.2$ repmgr cluster show
ID | Name   | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string                                             
----+--------+---------+---------------+----------+----------+----------+----------+-----------------------------------------------------------------
1  | node13 | primary | ? unreachable | ?        | default  | 3        |          | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2  | node14 | standby |   running     | ? node13 | default  | 2        | 1        | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3  | node15 | standby |   running     | ? node13 | default  | 1        | 1        | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4  | node16 | witness | * running     | ? node13 | default  | 0        | n/a      | host=node16 port=5432 dbname=repmgr user=repmgr password=123456


WARNING: following issues were detected
  - unable to connect to node "node13" (ID: 1)
  - node "node13" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "node14" (ID: 2)'s upstream node "node13" (ID: 1)
  - unable to determine if node "node14" (ID: 2) is attached to its upstream node "node13" (ID: 1)
  - unable to connect to node "node15" (ID: 3)'s upstream node "node13" (ID: 1)
  - unable to determine if node "node15" (ID: 3) is attached to its upstream node "node13" (ID: 1)
  - unable to connect to node "node16" (ID: 4)'s upstream node "node13" (ID: 1)


HINT: execute with --verbose option to see connection error messages


现在使用以下方法在node14上升级第一个备用数据库:
-bash-4.2$ repmgr standby promote
WARNING: 2 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
  node15 (node ID: 3)
  node16 (node ID: 4, witness server)
NOTICE: promoting standby to primary
DETAIL: promoting server "node14" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node14" (ID: 2) was successfully promoted to primary


查看当前集群的状态:
-bash-4.2$ repmgr cluster show
ID | Name   | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                             
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1  | node13 | primary | - failed  | ?        | default  | 3        |          | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2  | node14 | primary | * running |          | default  | 2        | 2        | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3  | node15 | standby |   running | ? node13 | default  | 1        | 1        | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4  | node16 | witness | * running | node14   | default  | 0        | n/a      | host=node16 port=5432 dbname=repmgr user=repmgr password=123456


WARNING: following issues were detected
  - unable to connect to node "node13" (ID: 1)
  - unable to connect to node "node15" (ID: 3)'s upstream node "node13" (ID: 1)
  - unable to determine if node "node15" (ID: 3) is attached to its upstream node "node13" (ID: 1)


HINT: execute with --verbose option to see connection error messages


通过查看集群状态可知,主库不可用的告警已自动清除,因为备库已接管成主库,但是仅仅是接管主库服务
目前node15节点的上级仍为node13,需要运行以下命令使未跟随新主服务器的备库重新加入新的集群中


在node15上,执行以下命令:
-bash-4.2$ repmgr standby follow
NOTICE: attempting to find and follow current primary
INFO: local node 3 can attach to follow target node 2
DETAIL: local node's recovery point: 0/5014D70; follow target node's fork point: 0/5014D70
NOTICE: setting node 3's upstream to node 2
NOTICE: stopping server using "/usr/bin/pg_ctl  -D '/var/lib/pgsql/12/data' -w -m fast stop"
NOTICE: starting server using "/usr/bin/pg_ctl  -w -D '/var/lib/pgsql/12/data' start"
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "node14" (ID: 2)


查看集群状态:
-bash-4.2$ repmgr cluster show
ID | Name   | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                             
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1  | node13 | primary | - failed  | ?        | default  | 3        |          | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2  | node14 | primary | * running |          | default  | 2        | 2        | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3  | node15 | standby |   running | node14   | default  | 1        | 1        | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4  | node16 | witness | * running | node14   | default  | 0        | n/a      | host=node16 port=5432 dbname=repmgr user=repmgr password=123456


WARNING: following issues were detected
  - unable to connect to node "node13" (ID: 1)


HINT: execute with --verbose option to see connection error messages




恢复node13服务器的服务,需要将备库重新加入集群中,在node13上执行
启动node13的服务:
systemctl start postgresql-12
此时的集群状态如下:
-bash-4.2$ repmgr cluster show
ID | Name   | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                             
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1  | node13 | primary | ! running |          | default  | 3        | 1        | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2  | node14 | primary | * running |          | default  | 2        | 2        | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3  | node15 | standby |   running | node14   | default  | 1        | 1        | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4  | node16 | witness | * running | node14   | default  | 0        | n/a      | host=node16 port=5432 dbname=repmgr user=repmgr password=123456


WARNING: following issues were detected
  - node "node13" (ID: 1) is running but the repmgr node record is inactive


将node13重新加入集群,这时只能以备库的身份加入:
在node13服务器上:
停止数据库:
[root@node13 12]# systemctl stop postgresql-12
-bash-4.2$ repmgr node rejoin -d 'host=172.16.1.14 dbname=repmgr user=repmgr password=123456' --force-rewind
        --force-rewind:如果测试环境主库没有数据变化,所以可以不加--force-rewind这个参数


-bash-4.2$ repmgr node rejoin -d 'host=172.16.1.14 dbname=repmgr user=repmgr password=123456'
NOTICE: rejoin target is node "node14" (ID: 2)
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 0/503BA38, rejoin target lsn is 0/503BE08
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=node13 port=5432 dbname=repmgr user=repmgr password=123456"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/bin/pg_ctl  -w -D '/var/lib/pgsql/12/data' start"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2


查看集群状态:
-bash-4.2$ repmgr cluster show
ID | Name   | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                             
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1  | node13 | standby |   running | node14   | default  | 3        | 2        | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2  | node14 | primary | * running |          | default  | 2        | 2        | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3  | node15 | standby |   running | node14   | default  | 1        | 2        | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4  | node16 | witness | * running | node14   | default  | 0        | n/a      | host=node16 port=5432 dbname=repmgr user=repmgr password=123456


以上说明集群恢复正常







分割线
感谢打赏
江西数库信息技术有限公司
YWSOS.COM 平台代运维解决方案
 评论
 发表评论
姓   名:

Powered by AKCMS