Looking for something? Try here..

Thursday, August 26, 2021

RMAN restore, DB_CREATE_FILE_DEST and mix of OMF and non OMF files

 We were trying to restore a production database on to a test environment which we do as a usual refresh when application needs. We followed the same procedure which was followed for other databases for long time but this time it failed for this specific database.


The init file of the target database consists the following parameters. 

*.db_create_file_dest='+ABD_DATA_DG01'
*.db_create_online_log_dest_1='+ABD_LOG_DG01'

Once the control file is placed in the respective location and the target database is in mount state, we did a restore which resulted in following error. 

Restore command used:

run {
set until time = "to_date ('22-Aug-2021 18:00', 'dd-Mon-yyyy hh24:mi')";
restore database;
recover database;
}

Error received: 

...
...
...
using channel ORA_DISK_7
using channel ORA_DISK_8

channel ORA_SBT_TAPE_1: starting datafile backup set restore
channel ORA_SBT_TAPE_1: specifying datafile(s) to restore from backup set
channel ORA_SBT_TAPE_1: restoring datafile 00031 to +ABP_DATA_DG01/ABP/sr3_28/sr3.data28
channel ORA_SBT_TAPE_1: restoring datafile 00103 to +ABP_DATA_DG01/ABP/sr3_89/sr3.data89
channel ORA_SBT_TAPE_1: restoring datafile 00129 to +ABP_DATA_DG01/ABP/DATAFILE/psapsr3usr.390.1049882411
channel ORA_SBT_TAPE_1: restoring datafile 00143 to +ABP_DATA_DG01/ABP/DATAFILE/psapsr3.402.1074438749
...
...
...
channel ORA_SBT_TAPE_8: restoring datafile 00091 to +ABP_DATA_DG01/ABP/sr3_79/sr3.data79
channel ORA_SBT_TAPE_8: restoring datafile 00104 to +ABP_DATA_DG01/ABP/sr3usr_3/sr3usr.data3
channel ORA_SBT_TAPE_8: reading from backup piece ./server1234/ABP/20210820/set628874_piece1_4a070fpn_628874_1_1
channel ORA_SBT_TAPE_1: ORA-19870: error while restoring backup piece ./server1234/ABP/20210820/set628870_piece1_46070fpm_628870_1_1
ORA-19504: failed to create file "+ABP_DATA_DG01/ABP/sr3_28/sr3.data28"
ORA-17502: ksfdcre:3 Failed to create file +ABP_DATA_DG01/ABP/sr3_28/sr3.data28
ORA-15001: diskgroup "ABP_DATA_DG01" does not exist or is not mounted
ORA-15018: diskgroup cannot be created
channel ORA_SBT_TAPE_1: starting datafile backup set restore
channel ORA_SBT_TAPE_1: specifying datafile(s) to restore from backup set
channel ORA_SBT_TAPE_1: restoring datafile 00009 to +ABP_DATA_DG01/ABP/sr3_6/sr3.data6
channel ORA_SBT_TAPE_1: restoring datafile 00037 to +ABP_DATA_DG01/ABP/sr3_34/sr3.data34
...
...
...
ORA-15018: diskgroup cannot be created
failover to previous backup

creating datafile file number=1 name=+ABP_DATA_DG01/ABP/system_1/system.data1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 08/23/2021 21:30:39
ORA-01180: can not create datafile 1
ORA-01110: data file 1: '+ABP_DATA_DG01/ABP/system_1/system.data1'

RMAN>

As per Oracle support note "RMAN is Not Restoring OMF Datafiles in Their Original Location (Doc ID 882555.1)", the order in which RMAN restores OMF files is as below
  1. If "SET NEWNAME" is specified, RMAN will use that name for restore.
  2. If the original file exists, RMAN will use the original filename for restore.
  3. If the DB_CREATE_FILE_DEST is set, RMAN will use the disk group name specified and create another name for that file in this disk group.
  4. If no DB_CREATE_FILE_DEST is set and the original file does not exist, then RMAN will create another name for that file in the original disk group.
Let's analyze where the restore went wrong. 
  • We have not set "SET NEWNAME", so point 1 is invalid here.
  • We dont have a database and overwriting the same, so point 2 is invalid
  • We have set DB_CREATE_FILE_DEST, so point 4 is invalid. 
Point 3 says RMAN should use the disk group name specified but if you see the error details closely, RMAN is checking for production database's disk group ABP_DATA_DG01 instead of the one provided to the parameter DB_CREATE_FILE_DEST = ABD_DATA_DG01. 
To top it up, there are 2 files already restored to the proper disk group before the error is thrown out. 

So what's happening here? 

After further analysis on why 2 files got created properly as expected and not others, we can see that the files which are being thrown error are not OMF and custom names provided by user. In this case, the files are created with these names by the application's space management software.

From one of the backup set: 

+ABP_DATA_DG01/ABP/sr3_28/sr3.data28 -- custom name
+ABP_DATA_DG01/ABP/sr3_89/sr3.data89 -- custom name
+ABP_DATA_DG01/ABP/DATAFILE/psapsr3usr.390.1049882411 -- Oracle Managed File (OMF)
+ABP_DATA_DG01/ABP/DATAFILE/psapsr3.402.1074438749 -- Oracle Managed File (OMF)

Since we have a mix of OMF and custom files, RMAN restored the OMF to proper disk group but it is looking for the original disk group for the non OMF files. 

How do we resolve this? 

Since we have a mix of OMF and non OMF files, we can make use of "SET NEWNAME" (point 1) for the restore to work properly. 

Revised restore command: 

run {
set until time = "to_date ('22-Aug-2021 18:00', 'dd-Mon-yyyy hh24:mi')";
SET NEWNAME FOR DATABASE TO '+ABD_DATA_DG01';
restore database;
switch datafile all;
switch tempfile all;
recover database;
}

After using SET NEWNAME command, the restore completed without any issues. 

References: 


Happy restoring...!!!

Wednesday, August 25, 2021

Database connections and controlling them - DCD

 In my previous post, we saw how to control the user connection by setting limits in database profiles. As a continuation to the post, we will see how to control user connections by using the below method

  • Dead Connection Detection


Unlike database profile limits which is applicable to both bequeath and network mode connections, this method of session control will only be applicable to network mode of connection as sqlnet.ora file will be used only when the oracle session is established using network via listener. 

Many of us would have seen the below error in the alert log file..

***********************************************************************

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 12.2.0.1.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
  Time: 24-AUG-2021 23:04:56
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.181)(PORT=14500))

We will see how this error is getting generated by a simple network failure simulation.
  1. Connect to database using network from a client server to DB server
  2. Simulate a network failure.
  3. Wait for sometime. (> 10 min)
Let's assume the following: 
Client machine        : linux-8 (192.168.56.181)
Database server       : linux75-2 (192.168.56.151) running 12c database
Connect identifier    : orcl 
(DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.56.151)(PORT = 1522))) (CONNECT_DATA = (SERVICE_NAME = orcl)))

[oracle@linux-8 ~]$ sqlplus test/test@orcl

SQL*Plus: Release 19.0.0.0.0 - Production on Sun Aug 22 19:55:59 2021
Version 19.10.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.

Last Successful login time: Sun Aug 22 2021 17:59:07 +05:30

Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production

SQL> 

Directly from within the linux-8 (client) server, I'm now simulating a network failure by shutting down the network adapter. So, now there is no network connection to communicate to the database server from my client machine. 
[oracle@linux-8 ~]$ nmcli connection down c59af835-6f37-449b-836a-47714f9fe29f
Connection 'enp0s8' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/5)
[oracle@linux-8 ~]$ 

Even after long time (> 20 min), I'm still able to see the connection inside the database (see logon_time and sysdate column) though the client server is having a network issue. (I still didn't bring up network adapter in linux-8)

SQL> !hostname
linux75-2.selvapc.com

SQL> @session_info.sql

       SID     SERIAL Username   STATUS   Machine              OS User         Status   Module          LOGON_TIME           SYSDATE
---------- ---------- ---------- -------- -------------------- --------------- -------- --------------- -------------------- --------------------
       769      43069 TEST       INACTIVE linux-8.selvapc.com  oracle          inactive SQL*Plus        23-AUG-2021 23:28:41 23-AUG-2021 23:55:29

SQL>

So in this case, even though the session is not active or the source of session is dead long back, still the connection is available inside database with status INACTIVE. These types of connections will fill up database and if the processes and/or sessions parameter is set low, they become an issue leading to ORA-00020 "maximum number of processes (%s) exceeded" errors. 

So in this case, how to control connections from becoming "zombie" processes and kill them as and when needed? 

The solution is actually called as Dead Connection Detection (DCD) 

Let's see how to implement it and what happens when we do it. 
But before that even after I shut down my linux-8 machine, the connection is still available inside the database taking 1 precious session doing nothing. In order to get rid of it, we just need to kill the session. 

SQL> @session_info.sql

       SID     SERIAL Username   STATUS   Machine              OS User         Status   Module          LOGON_TIME           SYSDATE
---------- ---------- ---------- -------- -------------------- --------------- -------- --------------- -------------------- --------------------
       769      43069 TEST       INACTIVE linux-8.selvapc.com  oracle          inactive SQL*Plus        23-AUG-2021 23:28:41 24-AUG-2021 00:07:34

SQL> alter system kill session '769,43069' immediate;

System altered.

SQL> @session_info.sql

no rows selected

SQL>

Dead Connection Detection: 

So in this section, we are going to do the following. 
  1. Set sqlnet.expire_time=2 in sqlnet.ora of database server
  2. Connect to database using network from a client server to DB server
  3. Simulate a network failure.
  4. Wait for sometime. (~ 2 min)
  5. Monitor alert log
[oracle@linux75-2 admin]$ pwd
/u01/orcl/122/network/admin
[oracle@linux75-2 admin]$ more sqlnet.ora
# sqlnet.ora Network Configuration File: /u01/orcl/122/network/admin/sqlnet.ora
# Generated by Oracle configuration tools.

NAMES.DIRECTORY_PATH= (TNSNAMES)
SQLNET.EXPIRE_TIME=2

Now, I have added the sqlnet.expire_time parameter to 2 minutes. 
I'll now connect to database, simulate network failure and provide you with the alert log entries with timestamp..

SQL> @session_info.sql

       SID     SERIAL Username   STATUS   Machine              OS User         Status   Module          LOGON_TIME           SYSDATE
---------- ---------- ---------- -------- -------------------- --------------- -------- --------------- -------------------- --------------------
       783      12666 TEST       INACTIVE linux-8.selvapc.com  oracle          inactive SQL*Plus        24-AUG-2021 23:30:54 24-AUG-2021 23:32:07

SQL> @find_spid.sql

SPID
------------------------
7199

SQL>

You can see the database connection got created at 24-AUG-2021 23:01:55

[oracle@linux-8 ~]$ date
Tue Aug 24 23:03:17 IST 2021
[oracle@linux-8 ~]$ sudo nmcli connection down c59af835-6f37-449b-836a-47714f9fe29f
[sudo] password for oracle: 
Connection 'enp0s8' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/2)
[oracle@linux-8 ~]$ 

I have simulated network interruption at 24-AUG-2021 23:03:17

After around 2 minutes, we got the operation time out error which I mentioned in the beginning of the post. 

***********************************************************************

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 12.2.0.1.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
  Time: 24-AUG-2021 23:04:56
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.181)(PORT=14500))

So what is happening here? How does DCD work? 

DCD works different with database version 12.1 and above when compared to lower versions. Since we now already have 21c released for on-premise deployments, we will see the mechanism of verison 12.1 and above only. 

When SQLNET.EXPIRE_TIME is set, there are 3 parameters associated with Operating system's tcp keepalive is set. Example is of Linux and the parameters may vary according to OS.
  • TCP_KEEPALIVE_TIME (the amount of time until the first keepalive packet is sent)
    • /proc/sys/net/ipv4/tcp_keepalive_time
  • TCP_KEEPCNT(the number of probes to send)
    • /proc/sys/net/ipv4/tcp_keepalive_probes
  • TCP_KEEPINTVL (the interval between keepalive packets)
    • /proc/sys/net/ipv4/tcp_keepalive_intvl
The default values of these parameters are 7200, 9 and 75 which means by default the TCP keepalive will probe 9 times with interval of 75 seconds after 7200 seconds. 

Now, when we set the sqlnet.expire_time parameter to 2 minutes, the TCP_KEEPALIVE_TIME will be set to 120 seconds with interval and probes adjusted accordingly with a lesser value. 
If we trace the oracle process connected with the help of strace utility, we can see there a poll() call every 120 seconds as below...

[oracle@linux75-2 ~]$ sudo strace -p 7199
strace: Process 7199 attached
read(17, 0x7fc9362c8f1e, 8208)          = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigprocmask(SIG_BLOCK, [], NULL, 8)  = 0
poll([{fd=17, events=POLLIN|POLLRDNORM}], 1, 0) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=120, tv_usec=0}}, NULL) = 0
rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=119, tv_usec=990000}}, NULL) = 0
rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=120, tv_usec=20000}}, NULL) = 0
rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
rt_sigreturn({mask=[]})                 = 0
read(17, 0x7fc9362c8f1e, 8208)          = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigprocmask(SIG_BLOCK, [], NULL, 8)  = 0
poll([{fd=17, events=POLLIN|POLLRDNORM}], 1, 0) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=120, tv_usec=0}}, NULL) = 0
rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=120, tv_usec=0}}, NULL) = 0
rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
...
...
...

Once the time out has occurred, the connection will be kicked out from the database freeing up resources. 

SQL> @session_info.sql

no rows selected

SQL>

Hope this post helps understanding Dead Connection Detection. 

Queries used: 

session_info.sql

col Username for a10
col Machine for a20
col Module for a15
col "OS User" for a15
SELECT
    sid,
    serial#           serial,
    username          "Username",
    status,
    machine           "Machine",
    osuser            "OS User",
    lower(status)     "Status",
    module            "Module",
    logon_time,
    sysdate
FROM
    v$session
WHERE
    username = 'TEST'
ORDER BY 1, 2, 3;

find_spid.sql

select SPID from v$process where ADDR in
(select PADDR from v$session where username='TEST');


References: 


Happy controlling...!!!