We use a simple concept to check for any standby gap or logs apply problems in all the high availability (Dataguard) servers in our environment. Use the below query in a script and make it run from cron periodically so we don’t have to go and check manually all the time.
spool /home/oracle/dba_scripts/logs/standby_db_sync_err.txt select PRIM-SEC from (select MAX(SEQUENCE#) PRIM FROM V$ARCHIVED_LOG WHERE DEST_ID=1 AND ARCHIVED='YES'), (select MAX(SEQUENCE#) SEC FROM V$ARCHIVED_LOG WHERE DEST_ID=2 AND APPLIED='YES'); spool off
Today I received log applied gaps error alert mail stating we have a log applied gap of 24 logs. My script runs every hour and sends the alert mail only when the log applied gap crosses 10 logs since I have delay parameter set in the database.
Note: I used the term “log applied gaps” as I’m meaning that the term is used to differentiate the logs applied on primary with logs applied on standby database.
I first thought that there might be some activity going on in the database and the gap will catch up soon as this happens most of the time, but now the gaps increased to 35, the next hour. So something is wrong , and I should check it.
I used the following query on the standby database which gave the below.
select sequence#, process, status from v$managed_standby; SEQUENCE# PROCESS STATUS ---------- --------- ------------ 0 ARCH CONNECTED 0 ARCH CONNECTED 0 ARCH CONNECTED 0 ARCH CONNECTED 0 ARCH CONNECTED 65999 MRP0 WAIT_FOR_LOG 66007 RFS IDLE 66008 RFS IDLE 66009 RFS IDLE
This is obvious, RFS gets the logs as and when generated at primary and MRP waits until the “delay” time reaches before it applies. The gap is not much but only 10 logs. Then I thought why the script gives wrong information?
I ran the script manually to check and still receive the same alert mail of log applied gap 35. So I ran the query separately in primary database to check/troubleshoot what is going on. I got the below response from the primary database.
select MAX(SEQUENCE#) PRIM FROM V$ARCHIVED_LOG WHERE DEST_ID=1 AND ARCHIVED='YES'; PRIM ---------- 66010 select MAX(SEQUENCE#) SEC FROM V$ARCHIVED_LOG WHERE DEST_ID=2 AND APPLIED='YES'; SEC ---------- 65975
Ok, I just wait for some time and check the same again but to see the same result. (Many a times, in standby environment I get the results delayed. J)
Now, it’s time to dig Oracle support documents to find what is really going on. I thought it might be some sort of bug as I haven’t faced such an issue before. As my prediction it turns to be a bug.
The MOS article “APPLIED - Column not updated if Heartbeat-ARCH hangs [ID 1369630.1]” states the issue we are dealing with is due to the bug “Bug 6113783 - Arch processes can hang indefinitely on network [ID 6113783.8]”.
The bug is fixed in 11.2.0.1 base release and I’m currently running my database in 10.2.0.4.
There are 2 options provided.
- Upgrade DB to 11.2.0.1 or higher
- Use the work around of killing the hung ARCH process in the primary database.
So how to find the hung ARCH process?
MOS says Look for Messages like
"ARCn: Becoming the heartbeat ARCH"
in the ALERT.LOG of the Primary Database to determine the current Heartbeat ARCn-Process.
So I searched my alert log file to find the below.
Fri Aug 17 05:04:08 2012 ARC2: Becoming the heartbeat ARCH ARC3 started with pid=103, OS id=28106 Fri Aug 17 05:04:08 2012
Check for the OS process id for ARC2
[oracle@****** bdump]$ ps -ef|grep ora_arc oracle 32224 1 0 2012 ? 02:43:53 ora_arc0_edisipd oracle 32226 1 0 2012 ? 02:21:49 ora_arc1_edisipd oracle 32232 1 0 2012 ? 02:41:07 ora_arc2_edisipd oracle 32234 1 0 2012 ? 02:26:41 ora_arc3_edisipd oracle 23429 16642 0 04:51 pts/1 00:00:00 grep ora_arc
Kill the Hung ARCH process
[oracle@****** bdump]$ kill -9 32232
Now check the alert log file again to find the below.
ARCH: Detected ARCH process failure ARCH: STARTING ARCH PROCESSES ARC2: Archival started ARCH: STARTING ARCH PROCESSES COMPLETE ARC2 started with pid=67, OS id=23425 ARC2: Becoming the heartbeat ARCH Wed Jan 16 04:53:47 2013
Ok, now the new process has been generated. Hence check with the primary database whether the APPLIED column is updated.
select MAX(SEQUENCE#) SEC FROM V$ARCHIVED_LOG WHERE DEST_ID=2 AND APPLIED='YES'; SEC ---------- 66003Now it has been updated successfully. J
It was a great article, helped me in resolving the most frequently occured issue.
ReplyDelete~Sandy
That is good to hear. Thank you!!
DeleteThis is excellent post. Appreciated for sharing such real time issue and solution
ReplyDeleteThank you!!
Delete