I was asked a question a few days ago whether DMLs will be failed over to surviving instance if one of the instance in a Oracle RAC is crashed.
I took sometime to get a proper answer to this question as I was not sure how DMLs were handled in RAC. All I know until that time was "Select" queries will be failed over to another instance. This mechanism is called Transparent Application Failover or TAF which is in Oracle from a long time since Oracle Parallel Server (OPS) introduced in Oracle 8i.
Transparent Application Failover
When an instance fails, TAF creates a new session on the surviving node or when the SELECT mode is configured, the sessions are replayed on the surviving node. For example, let us assume we are querying a table with a million records and when the query is in midway and node fails, the session will be created on the other surviving node to the state where the query is started and the SELECT is replayed in the other node to provide output to the end user/application. This new session creation on the other node is transparent to the end user/application.
FAN and ONS
Fast Application Notification (FAN) is a critical component in all of the High Availability solutions where continuous service and continuous connections built into RAC are extended to applications and application servers. When a state of database service changes such as down or unresponsive, the new status is posted to interested subscribers via FAN events. More on FAN can be found here
Oracle Notification Services (ONS) is installed as part of installation of Oracle Grid Infrastructure or Oracle dataguard installation. ONS is responsible for propagating FAN events to all other ONS daemons it is registered with. There is one ONS daemon started per cluster node. ONS facilitates FAN and also the load balancing feature of Oracle RAC for workload management.
Session Failover without TAF
Let's see an example of a session's behavior when we connect via the default service. This service is basic service created to connect to the database.
My TNS entry is as below
orabliss =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = ol819-scan)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = orabliss.selvapc.com)
)
)
SQL> select * from test_table; === Suppressed query output === 440,813 rows selected. Elapsed: 00:00:50.744
... ... ... OWNER -------------------------------------------------------------------------------------------------------------------------------- OBJECT_NAME -------------------------------------------------------------------------------------------------------------------------------- SUBOBJECT_NAME OBJECT_ID DATA_OBJECT_ID OBJECT_TYPE CREATED LAST_DDL_ -------------------------------------------------------------------------------------------------------------------------------- ---------- -------------- ----------------------- --------- --------- TIMESTAMP STATUS T G S NAMESPACE EDITION_NAME SHARING E O A ------------------- ------- - - - ---------- -------------------------------------------------------------------------------------------------------------------------------- ------------------ - - - DEFAULT_COLLATION D S CREATED_APPID CREATED_VSNID MODIFIED_APPID MODIFIED_VSNID ---------------------------------------------------------------------------------------------------- - - ------------- ------------- -------------- -------------- 2019-04-17:01:42:19 VALID N N N 1 METADATA LINK Y N N N ERROR: ORA-03135: connection lost contact Process ID: 55738 Session ID: 599 Serial number: 54452 29370 rows selected. SQL>
TAF is configured by creating a service on RAC and assigning failover parameters as shown below. Do not use the default service that is created while creating a container database or PDB as this service will be running on all the instances and running always on a RAC system. You should not alter or modify this service. This default service cannot be disabled, relocated or restricted so it doesn't provide high availability. Instead create a specific service with unique identifier to connect to the intended database with the intended properties.
From version 11.2 onwards we can create services using srvctl command.
srvctl add service -db cdbrac -pdb orabliss -service orabliss_taf -tafpolicy BASIC -failovertype SELECT -clbgoal LONG -preferred cdbrac1,cdbrac2
[oracle@ol819-rac1 ~]$ srvctl add service -db cdbrac -pdb orabliss -service orabliss_taf -tafpolicy BASIC -failovertype SELECT -clbgoal LONG -preferred cdbrac1,cdbrac2 [oracle@ol819-rac1 ~]$ srvctl start service -db cdbrac -service orabliss_taf [oracle@ol819-rac1 ~]$ srvctl status service -db cdbrac -service orabliss_taf Service orabliss_taf is running on instance(s) cdbrac1,cdbrac2 [oracle@ol819-rac1 ~]$
We now see the details of the services from the database as below
-tafpolicy accepts only BASIC from version 11.2 meaning the failover occurs only at the time of failure and there is no pre-connect option available.
-failovertype of SELECT enables the read on other surviving node once there is a crash on the session's existing node. Other option for TAF is SESSION where upon node failure the session is connected on the surviving instance but does not continue the SELECT.
Now, my TNS entry is as below using the newly created service
orabliss_taf =I'm connecting to the DB using the orabliss_taf service and simulate the crash as before.
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = ol819-scan)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = orabliss_taf.selvapc.com)
)
)
SQL> select program,sid,failover_type,failover_method,failed_over from v$session where sid=sys_context('userenv', 'sid'); PROGRAM SID FAILOVER_TYPE FAILOVER_M FAI ------------------------------------------------ ---------- ------------- ---------- --- sqlplus@ol819-rac1.selvapc.com (TNS V1-V3) 606 SELECT BASIC NO Elapsed: 00:00:00.00 SQL> select * from test_table; === Suppressed query output === === Introduced the crash while query is running midway === 440813 rows selected. Elapsed: 00:08:01.75 SQL> SQL> select program,sid,failover_type,failover_method,failed_over from v$session where sid=sys_context('userenv', 'sid'); PROGRAM SID FAILOVER_TYPE FAILOVER_M FAI ------------------------------------------------ ---------- ------------- ---------- --- sqlplus@ol819-rac1.selvapc.com (TNS V1-V3) 26 SELECT BASIC YES Elapsed: 00:00:00.06 SQL>While the original session was created with SID=606, the session failed over to node 1 and have a new SID=26. We can see the FAILED_OVER column also changed to YES.
Reconfiguration started (old inc 4, new inc 6) List of instances (total 1) : 1 Dead instances (total 1) : 2 My inst 1 publish big name space - dead or down/up instance detected, invalidate domain 0 Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE * dead instance detected - domain 2 invalid = TRUE, need cdb-level instance recovery * dead instance detected - domain 3 invalid = TRUE, need cdb-level instance recovery Communication channels reestablished Master broadcasted resource hash value bitmaps ... ... Post SMON to start 1st pass IR 2023-03-18T10:11:45.794262+05:30 CDB IR excluding pdb 2 which was cleanly closed. 2023-03-18T10:11:45.794383+05:30 Instance recovery: looking for dead threads 2023-03-18T10:11:45.812848+05:30 Beginning instance recovery of 1 threads ... ... ... Started redo application at Thread 2: logseq 83, block 157204, offset 0 ... ... ... 2023-03-18T10:11:46.025646+05:30 Redo thread 2 internally disabled at seq 84 (SMON) CDB instance recovery complete: pdb 2 valid 1 (flags x10, pdb flags x40080) CDB instance recovery complete: pdb 3 valid 1 (flags x10, pdb flags x40080) CDB instance recovery complete: pdb 0 valid 1 (flags x10, pdb flags x80) ... ...So, the Transparent Appliation Failover (TAF) option is great for applications which reads a lot as they don't have to worry about DB getting down as the services will get the work done transparent to the application.