這篇文章主要講解了“PostgreSQL同步復制主庫掛起分析”,文中的講解內容簡單清晰,易于學習與理解,下面請大家跟著小編的思路慢慢深入,一起來研究和學習“PostgreSQL同步復制主庫掛起分析”吧!
在Streaming Replication環境中PostgreSQL主節點設置為同步復制,如standby節點沒有啟動或者網絡出現問題沒法連接到主節點時,主節點如執行DML則進程會掛起,下面分析這個掛起的問題.
Latch
Latch結構體應被視為opaque”不透明的”,并且只能通過公共的函數訪問.在這里定義是運行把Latchs作為更大的結構體的一部分.
//通常情況下,int類型的變量通常是原子訪問的,也可以認為 sig_atomic_t就是int類型的數據, //因為對這些變量要求一條指令完成,所以sig_atomic_t不可能是結構體,只會是數字類型。 typedef int __sig_atomic_t; /* * Latch structure should be treated as opaque and only accessed through * the public functions. It is defined here to allow embedding Latches as * part of bigger structs. * Latch結構體應被視為"不透明的"opaque,并且只能通過公共的函數訪問. * 在這里定義是運行把Latchs作為更大的結構體的一部分. */ typedef struct Latch { sig_atomic_t is_set; bool is_shared; int owner_pid; #ifdef WIN32 HANDLE event; #endif } Latch;
N/A
啟動master節點,不啟動standby節點,使用psql連接數據庫,執行SQL,Session掛起:
testdb=# drop table t1;
使用gdb跟蹤掛起的進程
[xdb@localhost ~]$ ps -ef|grep postgres xdb 1318 1 0 12:14 pts/0 00:00:00 /appdb/xdb/pg11.2/bin/postgres xdb 1319 1318 0 12:14 ? 00:00:00 postgres: logger xdb 1321 1318 0 12:14 ? 00:00:00 postgres: checkpointer xdb 1322 1318 0 12:14 ? 00:00:00 postgres: background writer xdb 1323 1318 0 12:14 ? 00:00:00 postgres: walwriter xdb 1324 1318 0 12:14 ? 00:00:00 postgres: autovacuum launcher xdb 1325 1318 0 12:14 ? 00:00:00 postgres: archiver xdb 1326 1318 0 12:14 ? 00:00:00 postgres: stats collector xdb 1327 1318 0 12:14 ? 00:00:00 postgres: logical replication launcher xdb 1331 1318 0 12:15 ? 00:00:00 postgres: xdb testdb [local] DROP TABLE waiting for 0/5D07B668 [xdb@localhost ~]$ gdb -p 1331 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7 ...
查看調用棧
(gdb) bt #0 0x00007f4636d48903 in __epoll_wait_nocancel () from /lib64/libc.so.6 #1 0x000000000088e668 in WaitEventSetWaitBlock (set=0x21640e8, cur_timeout=-1, occurred_events=0x7ffc96572f40, nevents=1) at latch.c:1048 #2 0x000000000088e543 in WaitEventSetWait (set=0x21640e8, timeout=-1, occurred_events=0x7ffc96572f40, nevents=1, wait_event_info=134217761) at latch.c:1000 #3 0x000000000088dcec in WaitLatchOrSocket (latch=0x7f462d5b44d4, wakeEvents=17, sock=-1, timeout=-1, wait_event_info=134217761) at latch.c:385 #4 0x000000000088dbcd in WaitLatch (latch=0x7f462d5b44d4, wakeEvents=17, timeout=-1, wait_event_info=134217761) at latch.c:339 #5 0x0000000000863e2d in SyncRepWaitForLSN (lsn=1560786536, commit=true) at syncrep.c:286 #6 0x0000000000546279 in RecordTransactionCommit () at xact.c:1359 #7 0x0000000000546da3 in CommitTransaction () at xact.c:2074 #8 0x0000000000547a3f in CommitTransactionCommand () at xact.c:2817 #9 0x00000000008be250 in finish_xact_command () at postgres.c:2523 #10 0x00000000008bbf45 in exec_simple_query (query_string=0x20a1d78 "drop table t1;") at postgres.c:1170 #11 0x00000000008c0191 in PostgresMain (argc=1, argv=0x20cdcd8, dbname=0x20cdb40 "testdb", username=0x209ea98 "xdb") at postgres.c:4182 #12 0x000000000081e06c in BackendRun (port=0x20c3b10) at postmaster.c:4361 #13 0x000000000081d7df in BackendStartup (port=0x20c3b10) at postmaster.c:4033 #14 0x0000000000819bd9 in ServerLoop () at postmaster.c:1706 #15 0x000000000081948f in PostmasterMain (argc=1, argv=0x209ca50) at postmaster.c:1379 #16 0x0000000000742931 in main (argc=1, argv=0x209ca50) at main.c:228 (gdb)
kill進程,重新進入在WaitLatch上設置斷點進行跟蹤
######### [xdb@localhost ~]$ kill -9 1331 ######### testdb=# select pg_backend_pid(); pg_backend_pid ---------------- 1377 (1 row) ######### [xdb@localhost ~]$ gdb -p 1377 ... (gdb) b WaitLatch Breakpoint 1 at 0x88dbac: file latch.c, line 339. (gdb) ######### testdb=# drop table t1; ERROR: table "t1" does not exist testdb=# create table t1(id int);
進入斷點
(gdb) b WaitLatch Breakpoint 1 at 0x88dbac: file latch.c, line 339. (gdb) c Continuing. Breakpoint 1, WaitLatch (latch=0x7f462d5b44d4, wakeEvents=17, timeout=-1, wait_event_info=134217761) at latch.c:339 339 return WaitLatchOrSocket(latch, wakeEvents, PGINVALID_SOCKET, timeout, (gdb)
進入WaitLatchOrSocket
(gdb) step WaitLatchOrSocket (latch=0x7f462d5b44d4, wakeEvents=17, sock=-1, timeout=-1, wait_event_info=134217761) at latch.c:359 359 int ret = 0; (gdb) (gdb) p *latch $1 = {is_set = 0, is_shared = true, owner_pid = 1377}
構建等待事件集
(gdb) n 362 WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); (gdb) n 364 if (wakeEvents & WL_TIMEOUT) (gdb) 367 timeout = -1; (gdb) 369 if (wakeEvents & WL_LATCH_SET) (gdb) p *set $2 = {nevents = 0, nevents_space = 3, events = 0x2181eb8, latch = 0x0, latch_pos = 0, epoll_fd = 37, epoll_ret_events = 0x2181f00} (gdb) p *set->events $3 = {pos = 0, events = 0, fd = 0, user_data = 0x0} (gdb) p *set->epoll_ret_events $4 = {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}} (gdb) $5 = {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}} (gdb) n 370 AddWaitEventToSet(set, WL_LATCH_SET, PGINVALID_SOCKET, (gdb) 373 if (wakeEvents & WL_POSTMASTER_DEATH && IsUnderPostmaster) (gdb) 374 AddWaitEventToSet(set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, (gdb) 377 if (wakeEvents & WL_SOCKET_MASK) (gdb) 385 rc = WaitEventSetWait(set, timeout, &event, 1, wait_event_info); (gdb) p *set $6 = {nevents = 2, nevents_space = 3, events = 0x2181eb8, latch = 0x7f462d5b44d4, latch_pos = 0, epoll_fd = 37, epoll_ret_events = 0x2181f00} (gdb) p *set->events $7 = {pos = 0, events = 1, fd = 11, user_data = 0x0} (gdb) p *set->epoll_ret_events $8 = {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}} (gdb)
進入WaitEventSetWait
(gdb) step WaitEventSetWait (set=0x2181e90, timeout=-1, occurred_events=0x7ffc96572f40, nevents=1, wait_event_info=134217761) at latch.c:925 925 int returned_events = 0; (gdb)
輸入參數
(gdb) n 928 long cur_timeout = -1; (gdb) p *set $9 = {nevents = 2, nevents_space = 3, events = 0x2181eb8, latch = 0x7f462d5b44d4, latch_pos = 0, epoll_fd = 37, epoll_ret_events = 0x2181f00} (gdb) p *occurred_events $10 = {pos = 35135068, events = 0, fd = -1772664741, user_data = 0x7ffc96572fa0} (gdb)
執行相關判斷和設置參數
(gdb) n 930 Assert(nevents > 0); (gdb) 936 if (timeout >= 0) (gdb) 943 pgstat_report_wait_start(wait_event_info); (gdb) 946 waiting = true; (gdb)
未有事件出現,則循環
951 while (returned_events == 0) (gdb)
不符合set->latch->is_set為T的條件,繼續循環
982 if (set->latch && set->latch->is_set) (gdb) p *set->latch $11 = {is_set = 0, is_shared = true, owner_pid = 1377} (gdb)
進入WaitEventSetWaitBlock
(gdb) n 1000 rc = WaitEventSetWaitBlock(set, cur_timeout, (gdb) step WaitEventSetWaitBlock (set=0x2181e90, cur_timeout=-1, occurred_events=0x7ffc96572f40, nevents=1) at latch.c:1042 1042 int returned_events = 0; (gdb)
調用epoll_wait,掛起
(gdb) n 1048 rc = epoll_wait(set->epoll_fd, set->epoll_ret_events, (gdb) p *set $12 = {nevents = 2, nevents_space = 3, events = 0x2181eb8, latch = 0x7f462d5b44d4, latch_pos = 0, epoll_fd = 37, epoll_ret_events = 0x2181f00} (gdb) (gdb) n
啟動standby節點
#### [xdb@localhost ~]$ pg_ctl start pg_ctl: another server might be running; trying to start server anyway ...
接收到信號
Program received signal SIGUSR1, User defined signal 1. 0x00007f4636d48903 in __epoll_wait_nocancel () from /lib64/libc.so.6 (gdb) (gdb) n Single stepping until exit from function __epoll_wait_nocancel, which has no line number information. procsignal_sigusr1_handler (postgres_signal_arg=-1) at procsignal.c:262 262 { (gdb)
感謝各位的閱讀,以上就是“PostgreSQL同步復制主庫掛起分析”的內容了,經過本文的學習后,相信大家對PostgreSQL同步復制主庫掛起分析這一問題有了更深刻的體會,具體使用情況還需要大家實踐驗證。這里是億速云,小編將為大家推送更多相關知識點的文章,歡迎關注!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。