某天,應(yīng)用程序突然掛了,程序中報(bào)錯(cuò)連不上數(shù)據(jù)庫。登錄數(shù)據(jù)庫主機(jī)后,發(fā)現(xiàn)sqlplus也登錄不了,登錄時(shí),sqlplus長時(shí)間沒有響應(yīng),檢查alertlog只是發(fā)現(xiàn)半小時(shí)前檢測(cè)到一個(gè)deadlock,沒有其他的報(bào)錯(cuò)信息。心想oracle檢測(cè)到deadlock能自動(dòng)解鎖,因此一開始沒懷疑到是因?yàn)閐eadlock的原因,由于登錄不上數(shù)據(jù)庫,為了盡快的回復(fù)業(yè)務(wù),停掉應(yīng)用程序,殺掉ps -ef |grep LOCAL=N的進(jìn)程后,再次能登錄數(shù)據(jù)庫,重啟了數(shù)據(jù)庫,再次啟動(dòng)應(yīng)用,回復(fù)正常。
追溯原因時(shí),一開始是懷疑系統(tǒng)的內(nèi)核參數(shù)設(shè)置不正確,導(dǎo)致主機(jī)資源耗盡,因此無法sqlplus登錄。但是檢查了一遍主機(jī)的內(nèi)核參數(shù),發(fā)現(xiàn)均正常。主機(jī)的syslog同樣也沒發(fā)現(xiàn)異常。奇怪,為什么會(huì)出現(xiàn)掛死的現(xiàn)象?請(qǐng)教了老網(wǎng)蟲白鱔后,才明白了某些版本的oracle檢查到deadlock后會(huì)觸發(fā)一個(gè)關(guān)于deadlock的bug,導(dǎo)致enqueue hash chain的parent latch被hold住,然后在做SYSTEM DUMP的時(shí)候可能產(chǎn)生LATCH死鎖,這個(gè)時(shí)候可能hang住,時(shí)間可能是在幾秒鐘,也有可能是幾小時(shí)。
ARC0: Evaluating archive log 3 thread 1 sequence 36918
ARC0: Beginning to archive log 3 thread 1 sequence 36918
Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/arch_1_36918.arc'
Thu Jan 31 15:59:13 2008
ARC0: Completed archiving log 3 thread 1 sequence 36918
Thu Jan 31 16:05:30 2008
Thread 1 advanced to log sequence 36920
Current log# 1 seq# 36920 mem# 0: /dev/vg_ora1/rredo_256m_01
Current log# 1 seq# 36920 mem# 1: /dev/vg_ora2/rredo_256m_11
Current log# 1 seq# 36920 mem# 2: /dev/vg_ora3/rredo_256m_21
Thu Jan 31 16:05:30 2008
ARC1: Evaluating archive log 4 thread 1 sequence 36919
ARC1: Beginning to archive log 4 thread 1 sequence 36919
Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/arch_1_36919.arc'
Thu Jan 31 16:05:32 2008
ORA-000060: Deadlock detected. More info in file /oracle/app/oracle/admin/sd168/udump/sd168_ora_14652.trc.---- 注意此處
Thu Jan 31 17:00:11 2008
ARC1: Completed archiving log 4 thread 1 sequence 36919
Thu Jan 31 17:00:14 2008
SMON: Parallel transaction recovery tried
Thu Jan 31 17:02:34 2008
Thread 1 advanced to log sequence 36921
Current log# 2 seq# 36921 mem# 0: /dev/vg_ora1/rredo_256m_02
Current log# 2 seq# 36921 mem# 1: /dev/vg_ora2/rredo_256m_12
Current log# 2 seq# 36921 mem# 2: /dev/vg_ora3/rredo_256m_22
Thu Jan 31 17:02:34 2008
ARC0: Evaluating archive log 1 thread 1 sequence 36920
ARC0: Beginning to archive log 1 thread 1 sequence 36920
Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/arch_1_36920.arc'
ARC0: Completed archiving log 1 thread 1 sequence 36920
……
進(jìn)一步檢查這個(gè)deadlock的trace文件:
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /oracle/app/oracle/product/9.2.0
System name: HP-UX
Node name: sd_db01
Release: B.11.11
Version: U
Machine: 9000/800
Instance name: sd168
Redo thread mounted by this instance: 1
Oracle process number: 590
Unix process pid: 14652, image: oracle@sd_db01 (TNS V1-V3)
*** 2008-01-31 16:05:32.480
*** SESSION ID:(465.37272) 2008-01-31 16:05:32.403
DEADLOCK DETECTED
Current SQL statement for this session:
delete eips_corporation corp where corp.corpid=:1
The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:
Deadlock graph:
---------Blocker(s)-------- ---------Waiter(s)---------
Resource Name process session holds waits process session holds waits
TM-0000872c-00000000 590 465 SX SSX 581 281 SX SSX -------注意此處
TM-0000872c-00000000 581 281 SX SSX 590 465 SX SSX
session 465: DID 0001-024E-0000003F session 281: DID 0001-0245-0000002E
session 281: DID 0001-0245-0000002E session 465: DID 0001-024E-0000003F
Rows waited on:
Session 281: obj - rowid = 00008762 - AAAIdiACGAAAAAAAAA
(dictionary objn - 34658, file - 134, block - 0, slot - 0)
Session 465: obj - rowid = 00008762 - AAAIdiACGAAAAAAAAA
(dictionary objn - 34658, file - 134, block - 0, slot - 0)
Information on the OTHER waiting sessions:
Session 281:
pid=581 serial=53477 audsid=7366710 user: 54/EICP
O/S info: user: eicp, term: unknown, ospid: , machine: sd-oam01
program: JDBC Thin Client
application name: JDBC Thin Client, hash value=0
Current SQL Statement:
delete eips_corporation corp where corp.corpid=:1
End of information on OTHER waiting sessions.
===================================================
PROCESS STATE
-------------
Process global information:
process: c0000001991a1720, call: c0000001a57bcfd8, xact: c00000019bfbc5b0, curses: c0000001993bfd88, usrses: c0000001993bfd88
----------------------------------------
SO: c0000001991a1720, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=590, calls cur/top: c0000001a57bcfd8/c0000001a57bcfd8, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 197 0 4
last post received-location: kslpsr
last process to post me: c0000001990e5be0 1 6
last post sent: 0 0 16
last post sent-location: ksasnd
last process posted by me: c0000001990e5be0 1 6
(latch info) wait_event=0 bits=10
holding c0000001c1a4d458 Parent+children enqueue hash chains level=4 -------注意此處
Location from where latch is held: ksqcmi: kslgpl:
Context saved from call: 0
state=busy
recovery area:
Dump of memory from 0xC0000001990E4628 to 0xC0000001990E4638
……
檢查addr為c0000001c1a4d458的latch:
NAME
-------------------------------------------------------------- --
enqueue hash chains
確實(shí)我們看到了enqueue hash chain。enqueue hash chain是用于保護(hù)每一個(gè)library cache中的hash bucket。
(這里借用一下eygle的圖說明一下library cache的結(jié)構(gòu))
由于parent latch被HOLD了,所有對(duì)數(shù)據(jù)庫library cache加鎖的操作都無法進(jìn)行,而sqlplus登錄的時(shí)候要申請(qǐng)AUDSES$的SEQUENCE,因此sqlplus無法登錄,同理所有的應(yīng)用程序也無法登錄——應(yīng)用程序掛死。
在metalink上也確認(rèn)了這個(gè)bug:2530125
可是,在第二天下午,再次出現(xiàn)的由于deadlock導(dǎo)致數(shù)據(jù)庫掛死的問題,檢查alertlog發(fā)現(xiàn)在很早之前已經(jīng)有deadlock的現(xiàn)象,為何系統(tǒng)會(huì)頻繁的出現(xiàn)deadlock?
根據(jù)deadlock的trace文件,根據(jù)deadlock graph的wait的模式是SSX,這就很有可能是外鍵沒加索引造成。檢查相關(guān)語句中的外鍵,靠!確實(shí)是外鍵沒加索引。
對(duì)外鍵增加索引后,系統(tǒng)正常運(yùn)行一周,不再發(fā)生死鎖,也不再發(fā)生掛死現(xiàn)象。至此問題解決。回顧問題:由于外鍵沒加索引–引起死鎖–觸發(fā)bug2530125–enqueue hash latch–無法登錄–程序掛死。
from:http://www.oracleblog.cn/working-case/deadlock-cause-db-hang/
更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號(hào)聯(lián)系: 360901061
您的支持是博主寫作最大的動(dòng)力,如果您喜歡我的文章,感覺我的文章對(duì)您有幫助,請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長非常感激您!手機(jī)微信長按不能支付解決辦法:請(qǐng)將微信支付二維碼保存到相冊(cè),切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對(duì)您有幫助就好】元
