Looking for something? Try here..

Wednesday, August 25, 2021

Database connections and controlling them - DCD

 In my previous post, we saw how to control the user connection by setting limits in database profiles. As a continuation to the post, we will see how to control user connections by using the below method

  • Dead Connection Detection


Unlike database profile limits which is applicable to both bequeath and network mode connections, this method of session control will only be applicable to network mode of connection as sqlnet.ora file will be used only when the oracle session is established using network via listener. 

Many of us would have seen the below error in the alert log file..

***********************************************************************

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 12.2.0.1.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
  Time: 24-AUG-2021 23:04:56
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.181)(PORT=14500))

We will see how this error is getting generated by a simple network failure simulation.
  1. Connect to database using network from a client server to DB server
  2. Simulate a network failure.
  3. Wait for sometime. (> 10 min)
Let's assume the following: 
Client machine        : linux-8 (192.168.56.181)
Database server       : linux75-2 (192.168.56.151) running 12c database
Connect identifier    : orcl 
(DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.56.151)(PORT = 1522))) (CONNECT_DATA = (SERVICE_NAME = orcl)))

[oracle@linux-8 ~]$ sqlplus test/test@orcl

SQL*Plus: Release 19.0.0.0.0 - Production on Sun Aug 22 19:55:59 2021
Version 19.10.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.

Last Successful login time: Sun Aug 22 2021 17:59:07 +05:30

Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production

SQL> 

Directly from within the linux-8 (client) server, I'm now simulating a network failure by shutting down the network adapter. So, now there is no network connection to communicate to the database server from my client machine. 
[oracle@linux-8 ~]$ nmcli connection down c59af835-6f37-449b-836a-47714f9fe29f
Connection 'enp0s8' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/5)
[oracle@linux-8 ~]$ 

Even after long time (> 20 min), I'm still able to see the connection inside the database (see logon_time and sysdate column) though the client server is having a network issue. (I still didn't bring up network adapter in linux-8)

SQL> !hostname
linux75-2.selvapc.com

SQL> @session_info.sql

       SID     SERIAL Username   STATUS   Machine              OS User         Status   Module          LOGON_TIME           SYSDATE
---------- ---------- ---------- -------- -------------------- --------------- -------- --------------- -------------------- --------------------
       769      43069 TEST       INACTIVE linux-8.selvapc.com  oracle          inactive SQL*Plus        23-AUG-2021 23:28:41 23-AUG-2021 23:55:29

SQL>

So in this case, even though the session is not active or the source of session is dead long back, still the connection is available inside database with status INACTIVE. These types of connections will fill up database and if the processes and/or sessions parameter is set low, they become an issue leading to ORA-00020 "maximum number of processes (%s) exceeded" errors. 

So in this case, how to control connections from becoming "zombie" processes and kill them as and when needed? 

The solution is actually called as Dead Connection Detection (DCD) 

Let's see how to implement it and what happens when we do it. 
But before that even after I shut down my linux-8 machine, the connection is still available inside the database taking 1 precious session doing nothing. In order to get rid of it, we just need to kill the session. 

SQL> @session_info.sql

       SID     SERIAL Username   STATUS   Machine              OS User         Status   Module          LOGON_TIME           SYSDATE
---------- ---------- ---------- -------- -------------------- --------------- -------- --------------- -------------------- --------------------
       769      43069 TEST       INACTIVE linux-8.selvapc.com  oracle          inactive SQL*Plus        23-AUG-2021 23:28:41 24-AUG-2021 00:07:34

SQL> alter system kill session '769,43069' immediate;

System altered.

SQL> @session_info.sql

no rows selected

SQL>

Dead Connection Detection: 

So in this section, we are going to do the following. 
  1. Set sqlnet.expire_time=2 in sqlnet.ora of database server
  2. Connect to database using network from a client server to DB server
  3. Simulate a network failure.
  4. Wait for sometime. (~ 2 min)
  5. Monitor alert log
[oracle@linux75-2 admin]$ pwd
/u01/orcl/122/network/admin
[oracle@linux75-2 admin]$ more sqlnet.ora
# sqlnet.ora Network Configuration File: /u01/orcl/122/network/admin/sqlnet.ora
# Generated by Oracle configuration tools.

NAMES.DIRECTORY_PATH= (TNSNAMES)
SQLNET.EXPIRE_TIME=2

Now, I have added the sqlnet.expire_time parameter to 2 minutes. 
I'll now connect to database, simulate network failure and provide you with the alert log entries with timestamp..

SQL> @session_info.sql

       SID     SERIAL Username   STATUS   Machine              OS User         Status   Module          LOGON_TIME           SYSDATE
---------- ---------- ---------- -------- -------------------- --------------- -------- --------------- -------------------- --------------------
       783      12666 TEST       INACTIVE linux-8.selvapc.com  oracle          inactive SQL*Plus        24-AUG-2021 23:30:54 24-AUG-2021 23:32:07

SQL> @find_spid.sql

SPID
------------------------
7199

SQL>

You can see the database connection got created at 24-AUG-2021 23:01:55

[oracle@linux-8 ~]$ date
Tue Aug 24 23:03:17 IST 2021
[oracle@linux-8 ~]$ sudo nmcli connection down c59af835-6f37-449b-836a-47714f9fe29f
[sudo] password for oracle: 
Connection 'enp0s8' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/2)
[oracle@linux-8 ~]$ 

I have simulated network interruption at 24-AUG-2021 23:03:17

After around 2 minutes, we got the operation time out error which I mentioned in the beginning of the post. 

***********************************************************************

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 12.2.0.1.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
  Time: 24-AUG-2021 23:04:56
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.181)(PORT=14500))

So what is happening here? How does DCD work? 

DCD works different with database version 12.1 and above when compared to lower versions. Since we now already have 21c released for on-premise deployments, we will see the mechanism of verison 12.1 and above only. 

When SQLNET.EXPIRE_TIME is set, there are 3 parameters associated with Operating system's tcp keepalive is set. Example is of Linux and the parameters may vary according to OS.
  • TCP_KEEPALIVE_TIME (the amount of time until the first keepalive packet is sent)
    • /proc/sys/net/ipv4/tcp_keepalive_time
  • TCP_KEEPCNT(the number of probes to send)
    • /proc/sys/net/ipv4/tcp_keepalive_probes
  • TCP_KEEPINTVL (the interval between keepalive packets)
    • /proc/sys/net/ipv4/tcp_keepalive_intvl
The default values of these parameters are 7200, 9 and 75 which means by default the TCP keepalive will probe 9 times with interval of 75 seconds after 7200 seconds. 

Now, when we set the sqlnet.expire_time parameter to 2 minutes, the TCP_KEEPALIVE_TIME will be set to 120 seconds with interval and probes adjusted accordingly with a lesser value. 
If we trace the oracle process connected with the help of strace utility, we can see there a poll() call every 120 seconds as below...

[oracle@linux75-2 ~]$ sudo strace -p 7199
strace: Process 7199 attached
read(17, 0x7fc9362c8f1e, 8208)          = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigprocmask(SIG_BLOCK, [], NULL, 8)  = 0
poll([{fd=17, events=POLLIN|POLLRDNORM}], 1, 0) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=120, tv_usec=0}}, NULL) = 0
rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=119, tv_usec=990000}}, NULL) = 0
rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=120, tv_usec=20000}}, NULL) = 0
rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
rt_sigreturn({mask=[]})                 = 0
read(17, 0x7fc9362c8f1e, 8208)          = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigprocmask(SIG_BLOCK, [], NULL, 8)  = 0
poll([{fd=17, events=POLLIN|POLLRDNORM}], 1, 0) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=120, tv_usec=0}}, NULL) = 0
rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=120, tv_usec=0}}, NULL) = 0
rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
...
...
...

Once the time out has occurred, the connection will be kicked out from the database freeing up resources. 

SQL> @session_info.sql

no rows selected

SQL>

Hope this post helps understanding Dead Connection Detection. 

Queries used: 

session_info.sql

col Username for a10
col Machine for a20
col Module for a15
col "OS User" for a15
SELECT
    sid,
    serial#           serial,
    username          "Username",
    status,
    machine           "Machine",
    osuser            "OS User",
    lower(status)     "Status",
    module            "Module",
    logon_time,
    sysdate
FROM
    v$session
WHERE
    username = 'TEST'
ORDER BY 1, 2, 3;

find_spid.sql

select SPID from v$process where ADDR in
(select PADDR from v$session where username='TEST');


References: 


Happy controlling...!!!

No comments:

Post a Comment