Looking for something? Try here..

Friday, April 7, 2023

Oracle Transparent Application Continuity (TAC) - Part 2

I have started a transaction in my bank account and money got deducted but the screen doesn't get me the confirmation of the transaction and ended with a time out error. There could be multiple reason and one of the reason being one of the node on which the session was working crashed.

How many of you encountered this situation?

Now, if the bank/application once recovered and connected to the database (on a surviving node) and there is no details of the previous transaction available to the new session just got connected. What happens to the data? Would it get committed?

If the application tries to run the same previous transaction, money might be deducted twice. If the application doesn't retry, your transaction might not be carried out and you might miss some important last date of bill payment

So, how does Oracle handles this? 

In my last post, I explained about how Transparent Application Failover works and how it would be a great advantage for applications that mostly reads the database. 

As a continuation of that post, we will see about the Transparent Application Continuity on what is it and how does it benefit applications utilizing this feature of Oracle. 

Okay! We should know about Transaction Guard and Application Continuity before we dive into Transparent Application Continuity. Let's see them in brief to understand what they are.. 

Transaction Guard:

Transaction Guard provides a generic tool for applications to use for at-most-once execution in case of planned and unplanned outages. Applications use the logical transaction ID to determine the outcome of the last transaction open in a database session following an outage.
If you take the bank example, if the commit already happened and if application tries to replay the execution without knowing the status of the last transaction, this might lead to logical corruption of the data by committing duplicate transaction.

Benefits of Transaction Guard includes preserving the commit outcome, ensuring known outcome of every transaction and it can be used to provide at-most-once transaction execution

Let's see the illustration below



DB node crashed when: 
Session 1 committed the transaction but acknowledgement is not yet returned to application. Application now don't know status of the previous transaction
Session 2 is idle, hence no impact to the session even this session goes off
Session 3 issues the commit but before the commit is received in the database

So without transaction guard, session 1 and 3 won't be able to proceed further since they don't know the status of the previous transaction. Client is left unknown whether transaction is committed. Logical corruption is possible if application try to rerun the transaction and missing data is possible if the transaction is not retried. This is why it is difficult to failover DML when a failure occurs on a specific node as the application fails to recognize that the last submission has committed, or shall commit sometime soon, or has not run to completion. 

How do we resolve this mystery? What if we store the commit outcome in the database itself before letting the application/client know? This is what was done by introducing LTXID in Oracle database version 12.1.0.1. 

Durability of commit is saved as a Logical Transaction Identifier (LTXID) at commit for all supported transaction types and is unique for each transaction. Applications can use this LTXID to determine the commit status. LTXID is persisted on commit and reused following a rollback. The transaction history table maintains the mapping of LTXIDs to database transaction. LTXID_TRANS table is created to store the LTXIDs and is not accessible by non-DBA user. 
Hope it is clear what exactly is transaction guard provides us. 

Transaction Guard is enabled using proper database service configuration and the requirements are as follows

1. Use a non-default service
2. Set the property COMMIT_OUTCOME=TRUE in the service to determine COMMIT is accessible and to get the commit outcome of the last transaction. RETENTION_TIMEOUT can be increased if necessary (default is 24 hrs)
3. Ensure that FAN is configured with ONS to talk with 12c database clients
 
In depth understanding of Transaction Guard is provided in this whitepaper

If I had to modify the orabliss_taf service that we created in our previous post to include transaction guard, we need to modify the service as below
srvctl modify service -db cdbrac -service orabliss_taf -commit_outcome TRUE
[oracle@ol819-rac2 bin]$ srvctl modify service -db cdbrac -service orabliss_taf -commit_outcome TRUE
[oracle@ol819-rac2 bin]$ srvctl config service -db cdbrac -service orabliss_taf |egrep "Commit|Retention"
Commit Outcome: true
Retention: 86400 seconds
[oracle@ol819-rac2 bin]$

Connection Pools and Application Continuity: 

Application Continuity was first introduced for JDBC thin applications in Oracle 12.1 which uses an Oracle Connection pool such as Universal Connection Pool (UCP) for JDBC clients and the support for OCI was enhanced in Oracle 12.2. Connection pools can also be configured for third party applications as well. 

A connection pool is a cache of database connection objects. The objects represent physical database connections that can be used by an application to connect to a database. At run time, the application requests a connection from the pool which reuses the connection objects and reduces new connection object creation. The pool also leverages many high availability and performance features available through an Oracle RAC database. These features include Fast Connection Failover (FCF), Run-time connection Load Balancing (RLB), and Connection Affinity.

An LTXID is associated with an application request at check-out from the connection pool, and is disassociated from the application request at check-in back to the pool. Between check-out and check-in, the LTXID on the session is exclusively held by that application request meaning the LTXID is associated with the next connection session. This allows to detect duplication and failover and also basic replay by third party containers replaying the last request after a non recoverable outage. 

Application Continuity makes use of Transaction Guard (LTXID) and replays database requests exactly as they were originally submitted, to ensure a result consistent with the request. To do this, it first restores and validates the session on surviving instance and replays as necessary. 
Mutables such as NEXTVAL of sequences and calls to sysdate are taken care by using the same values when they are replayed. This is accomplished by calls to Transaction Guard. 

So KEEP privileges for mutables are required to be provided to users.

To configure Application Continuity the following parameters are needed in the service configuration. 


So for Application Continuity to be enabled in the Oracle RAC, I have to modify my service as below. 
srvctl modify service -db cdbrac -service orabliss_taf -failover_restore LEVEL1 -failoverdelay 30 -failoverretry 5 -failovertype TRANSACTION -replay_init_time 1800 -notification TRUE -drain_timeout 300 -stopoption IMMEDIATE
As a Database Administrator, Application Continuity has a major blocker for implementation since there needs to be code changes in the application to make use of both connection pools and get LTXID return codes and program as necessary. In this post, we are not going to test how Application Continuity works as that requires client coding. 

With Oracle 18c, Oracle introduced Transparent Application Continuity which eliminates the use of connection pools (still is recommended) and in Oracle 19c, it is further enhanced and extends support to applications that do not use connection pools. 

Transparent Application Continuity (TAC):

Transparent Application Continuity is a mode of Application Continuity that transparently tracks and records session and transactional state so that a DB session can be recovered following an outage. There are no code changes involved in the application and DBA doesn't even needs to know of application code to implement TAC. TAC uses LTXID and a state tracking infrastructure 

To use TAC, the following configuration/steps can be followed

  1. Use a supported client. SQLPLUS 19c or later is supported and we will be using this in our demo. Few other clients are Oracle JDBC Replay Driver 18c or later, UCP 18c or later, OCI Session Pool 19c or later, etc. 
  2. If using connection pool, connections should be returned to the connection pool
  3. Set FAILOVER_RESTORE to AUTO
  4. Enable mutables use in application by providing KEEP grants to Sequences and date time to users. Oracle 19c automatically KEEPs the mutables for sql. 
  5. SIDE EFFECTS (like an extenal action such as sending emails, transfer files, etc) are disabled.
With this change, if I need to enable TAC in my database I have to modify my service with below parameters (compare this with Application Continuity)



You may notice failover_restore and failovertype both defined as AUTO to enable TAC. My service modification would be as below. 
srvctl modify service -db cdbrac -service orabliss_taf -failover_restore AUTO -failoverdelay 30 -failoverretry 5 -failovertype AUTO -replay_init_time 1800 -notification TRUE -drain_timeout 300 -stopoption IMMEDIATE

Demo of Transparent Application Continuity (TAC): 

Let's check how TAC gives a seemless application failover with DML statements by doing the following. 
  1. Connect to database using SQLPLUS with oracle_taf without Transaction Guard enabled (without -commit_outcome=TRUE)
  2. Perform a DML and check ltxid_trans table from a different SYS session
  3. Shut down instance running DML and observe error
  4. Modify oracle_taf to enable Transaction Guard and connect to database using the service
  5. Perform DML and check ltxid_trans table from a different session
  6. Modify oracle_taf to enable Transparent Application Continuity 
  7. Connect to database using the TAC enabled service
  8. Perform DML and shut down instance running DML and observe the results
[oracle@ol819-rac1 ~]$ sqlplus dummy/dummy@orabliss_taf

<<< Removed Banner >>> 

dummy@node1> select count(*) from test_table;

  COUNT(*)
----------
   8199536

dummy@node1> delete from test_table where object_id=57732;

112 rows deleted.

dummy@node1> commit;

Commit complete.

dummy@node1>

Check ltxid_trans table
[oracle@ol819-rac1 ~]$ sql / as sysdba

<<< Removed Banner >>>

SQL> @/home/oracle/ccon
Enter value for container: orabliss
old:alter session set container=&container
new:alter session set container=orabliss

Session altered.

SQL>
SQL> select * from ltxid_trans;

no rows selected

SQL> 
I'll now trigger a delete on the table and while the delete is running midway, I'll shut the OS (node 2) abruptly to create a crash scenario. 
sys@node1> @/home/oracle/sessions.sql

   INST_ID    SID    SERIAL    SQL_ID    Username    Seconds in Wait    Command                   Machine    OS User      Status                                        Module
__________ ______ _________ _________ ___________ __________________ __________ _________________________ __________ ___________ _____________________________________________
         2    424      1281           DUMMY                                     ol819-rac1.selvapc.com    oracle     inactive    SQL*Plus

sys@node1> 

dummy@node1> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
424

dummy@node1> delete from test_table where rownum<500000; 

499999 rows deleted.

dummy@node1>

<<< Shut down node 2 abruptly to create crash scenario >>>

dummy@node1> commit;
commit
     *
ERROR at line 1:
ORA-25405: transaction status unknown


dummy@node1> select program,sid,failover_type,failover_method,failed_over from v$session where sid=sys_context('userenv', 'sid');

PROGRAM                                                 SID FAILOVER_TYPE FAILOVER_M FAI
------------------------------------------------ ---------- ------------- ---------- ---
sqlplus@ol819-rac1.selvapc.com (TNS V1-V3)              460 SELECT        BASIC      YES

dummy@node1> select count(*) from test_table;

  COUNT(*)
----------
   8199424

dummy@node1>

We can see the transaction didn't commit even though the session got failed over to the other instance. 
What if there is a select involved with the DML statement? Let's check that out..
dummy@node1> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
430

dummy@node1> insert into test_table select * from test_table;

 
Session 430 from node 1 got connected to node 2 and started an insert statement. While the statement is under process, I shutdown the node 2 abruptly. 
sys@node1> @/home/oracle/sessions.sql

   INST_ID    SID    SERIAL    SQL_ID    Username    Seconds in Wait    Command                   Machine    OS User      Status      Module
__________ ______ _________ _________ ___________ __________________ __________ _________________________ __________ ___________ ___________
         2    430     10274           DUMMY                                     ol819-rac1.selvapc.com    oracle     inactive    SQL*Plus

<<< Shut down node 2 here... >>>

sys@node1> @/home/oracle/sessions.sql

no rows selected

 
After sometime, when I cancelled the session I got the below error
dummy@node1> insert into test_table select * from test_table;

^Cinsert into test_table select * from test_table
*
ERROR at line 1:
ORA-25408: can not safely replay call


dummy@node1> 
This is because the fact that Transparent Application Failover (TAF) cannot handle transactions that include data manipulation language (DML). In this case, we need to log out of application, reconnect and rerun the transaction. We can see the session got failedover to instance 1

dummy@node1> select program,sid,failover_type,failover_method,failed_over from v$session where sid=sys_context('userenv', 'sid');

PROGRAM                                                 SID FAILOVER_TYPE FAILOVER_M FAI
------------------------------------------------ ---------- ------------- ---------- ---
sqlplus@ol819-rac1.selvapc.com (TNS V1-V3)              175 SELECT        BASIC      YES

dummy@node1>

We will now modify the service to enable Transaction guard and check the ltxid_trans table to check commit data details. Note: We are not going to write code to get the commit_outcome and test but will just see the contents of the ltxid_trans table. We will proceed with TAC once this check is done.
[oracle@ol819-rac1 ~]$ srvctl modify service -db cdbrac -service orabliss_taf -commit_outcome TRUE
[oracle@ol819-rac1 ~]$ srvctl config service -db cdbrac -service orabliss_taf |egrep "Commit|Retention"
Commit Outcome: true
Retention: 86400 seconds
[oracle@ol819-rac1 ~]$ sqlplus dummy/dummy@orabliss_taf

dummy@node1> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
423

dummy@node1> delete from test_table where rownum<500000;

499999 rows deleted.

dummy@node1> commit;

Commit complete.

dummy@node1>
SYS@node2> select * from ltxid_trans;

   MAJ_VERSION    MIN_VERSION    INST_ID         DB_ID                        SESSION_GUID    TXN_UID    COMMIT_NO                                START_DATE    SERVICE_ID    STATE    FLAGS    REQ_FLAGS    ERROR_CODE
______________ ______________ __________ _____________ ___________________________________ __________ ____________ _________________________________________ _____________ ________ ________ ____________ _____________
             1              2          2    2960387100 F8B741BDAFB62A60E053CA38A8C0014E           111            0 07-APR-23 08.33.32.185503000 AM +05:30                1        1        0            0             0

SYS@node2>
Now, as we have enabled transaction guard once the transaction is committed, we can see the ltxid_trans table is updated with the session and transaction information. This information will be used for replaying the transaction during failover

As we have enabled only Transaction Guard as of now and if the node crashed before committing the data, we would be getting the below error message if we try to commit the data after the session failover to the surviving instance. 
dummy@node1> insert into test_table select * from test_table;

7199426 rows created.

<<< Shutdown OS abruptly >>>

dummy@node1> commit;
commit
     *
ERROR at line 1:
ORA-25402: transaction must roll back

dummy@node1> 
Now, as the final phase of our testing we will enhance the orabliss_taf to enable Transparent Application Continuity (TAC) and observe the results 

[oracle@ol819-rac1 ~]$ srvctl modify service -db cdbrac -service orabliss_taf -failover_restore AUTO -failovertype AUTO -failoverdelay 30 -failoverretry 5 -replay_init_time 1800 -notification TRUE -drain_timeout 300 -stopoption IMMEDIATE
[oracle@ol819-rac1 ~]$
[oracle@ol819-rac1 ~]$ sqlplus dummy/dummy@orabliss_taf

...
...

dummy@node1> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
232

dummy@node1> insert into test_table select * from test_table;

<<< Shut down node 2 while insert is running >>>

14898851 rows created.

dummy@node1> dummy@node1> commit;

Commit complete.

dummy@node1> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
42

dummy@node1> select program,sid,failover_type,failover_method,failed_over from v$session where sid=sys_context('userenv', 'sid');

PROGRAM                                                 SID FAILOVER_TYPE FAILOVER_M FAI
------------------------------------------------ ---------- ------------- ---------- ---
sqlplus@ol819-rac1.selvapc.com (TNS V1-V3)               42 AUTO          BASIC      YES

dummy@node1>
Though I have introduced the crash in between the DML statement, the session failed over to instance 1 and completed the insert statement without any issues. 
Session state from node 1 was observed as below.
SYS@node1> @/home/oracle/sessions.sql

   INST_ID    SID    SERIAL           SQL_ID    Username    Seconds in Wait    Command                   Machine    OS User      Status      Module
__________ ______ _________ ________________ ___________ __________________ __________ _________________________ __________ ___________ ___________
         1     24      7850                  DUMMY                                     ol819-rac1.selvapc.com    oracle     inactive    SQL*Plus
         2    232     62564 19uk9dbxy0zbr    DUMMY                        3 INSERT     ol819-rac1.selvapc.com    oracle     active      SQL*Plus

SYS@node1> @/home/oracle/sessions.sql

   INST_ID    SID    SERIAL           SQL_ID    Username    Seconds in Wait    Command                   Machine    OS User      Status      Module
__________ ______ _________ ________________ ___________ __________________ __________ _________________________ __________ ___________ ___________
         1     24      7850                  DUMMY                                     ol819-rac1.selvapc.com    oracle     inactive    SQL*Plus
         1     42     53960 19uk9dbxy0zbr    DUMMY                        2 INSERT     ol819-rac1.selvapc.com    oracle     active      SQL*Plus

SYS@node1>
We can see the session 232 was running the INSERT statement against INST_ID 2 and then failed over to INST_ID 1 with SID 42. 

Conclusion: 

So the answer to my initial question "Whether DMLs will be failed over to surviving instance if one of the instance in a Oracle RAC is crashed?" is now clear. 
Oracle introduced Transaction Guard which is effectively enhanced to be utilized by Transparent Application Continuity (TAC) that can be implemented in the database without any code changes to application to have a downtime free database experience.. !

References: 



Happy TACing...!!!



1 comment: