Looking for something? Try here..

Saturday, May 1, 2021

Broken 18c GI upgrade with ORA-29702 and fixes

 Ok, I was very confident the upgrade would succeed when the upgrade started to run after fixing all the necessary pre requisites and is running without issues. I also took a screen shot of the progress as below after starting the rootupgrade.sh See software version already in 18.0.0.0.0


Then after few minutes of my wait, the upgrade failed on node 1. When the rootupgrade.sh script was started and running during Executing upgrade step 17 of 19: 'UpgradeCluster' step, the below error was encountered. 

CRS-2672: Attempting to start 'ora.asm' on '12r1-rac1'
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
Linux-x86_64 Error: 2: No such file or directory
Additional information: 4150
Additional information: 67028943
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107:)" in "/oradir1/grid/base/diag/crs/12r1-rac1/crs/trace/ohasd_oraagent_oracle.trc".
CRS-2674: Start of 'ora.asm' on '12r1-rac1' failed
CRS-2679: Attempting to clean 'ora.asm' on '12r1-rac1'
CRS-2681: Clean of 'ora.asm' on '12r1-rac1' succeeded
CRS-2672: Attempting to start 'ora.storage' on '12r1-rac1'
CRS-2883: Resource 'ora.storage' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-4000: Command Start failed, or completed with errors.
2021/04/25 17:38:06 CLSRSC-117: Failed to start Oracle Clusterware stack
Died at /oradir1/grid/18/crs/install/crsupgrade.pm line 1462.

The alert log of the +ASM1 of 18c home has the following errors. 

MMNL started with pid=29, OS id=26058
2021-04-25T17:27:01.603535+05:30
Unable to start the instance due to software incompatiblity
2021-04-25T17:27:01.604426+05:30
ORA-15156: cluster in rolling upgrade from version [12.1.0.1.0] to [18.0.0.0.0]

2021-04-25T17:27:01.853249+05:30
Error: Shutdown in progress. Error: 29702.
USER (ospid: 25746): terminating the instance due to ORA error 29702
2021-04-25T17:27:02.925386+05:30
Instance terminated by USER, pid = 25746

Detailed info is available in /oradir1/grid/base/diag/crs/12r1-rac1/crs/trace/ohasd_oraagent_oracle.trc file which has reported the below errors.

...
...
...
...
2021-04-25 17:27:14.156 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstConnection:connectInt 200 connected
2021-04-25 17:27:14.156 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 040 resname:ora.asm startup
2021-04-25 17:27:17.534 :CLSFRAME:2796190720:  TM [MultiThread] is changing desired thread # to 5. Current # is 4
2021-04-25 17:27:17.534 :CLSDYNAM:2539628288: [ora.evmd]{0:0:2} [check] DaemonAgent::check returned 0
2021-04-25 17:27:17.534 :CLSDYNAM:2539628288: [ora.evmd]{0:0:2} [check] Deep check returned 0
2021-04-25 17:27:22.679 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] ORA-29702: error occurred in Cluster Group Service operation

2021-04-25 17:27:22.680 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 050 errcode 29702
2021-04-25 17:27:22.680 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 160 ORA-29701 or ORA-29702 instance dumping m_instanceType:2 m_lastOCIError:29702
2021-04-25 17:27:22.680 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 170 ORA-29701 or ORA-29702 shutdown abort
2021-04-25 17:27:22.680 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstConnection::shutdown mode 4
2021-04-25 17:27:32.681 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 260 disConnect
2021-04-25 17:27:32.681 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 020 connect logmode:8008
2021-04-25 17:27:32.681 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstConnection::connectInt 020 server not attached
2021-04-25 17:27:33.726 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstConnection:connectInt 200 connected
2021-04-25 17:27:33.727 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 040 resname:ora.asm startup
2021-04-25 17:27:42.263 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] ORA-29702: error occurred in Cluster Group Service operation

2021-04-25 17:27:42.263 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 050 errcode 29702
2021-04-25 17:27:42.263 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 160 ORA-29701 or ORA-29702 instance dumping m_instanceType:2 m_lastOCIError:29702
2021-04-25 17:27:42.263 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 170 ORA-29701 or ORA-29702 shutdown abort
2021-04-25 17:27:42.263 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstConnection::shutdown mode 4
...
...
...

This is a weird error right. Error says unable to start the instance due to software incompatibility and you go make a search on Oracle support for "ORA 29702 during GI upgrade", it returns n number of issues with references to a few bugs as well. Nothing seems to be fitting the issue that I have encountered. 
One more interesting thing is that, if I set the environment variables to 18c home and start the ASM instance, the instance starts without issue. So I then again shutdown the ASM instance and restarted the rootupgrade.sh script. This time too the same error occurs when Oracle tries to start ASM and the upgrade fails. So, after long search and different attempts, I was not able to fix the issue. I cancelled the upgrade session and decided to look into the issue next day. 

On the next day when I start both the nodes, I see that the ASM instance on the first node is up from 18c home and no services started on the second node. I decided to continue the upgrade from where I left. Since we already completed the software set up and our error occurred during the execution of rootupgrade.sh script, I started from the same step after shutting down the entire cluster running on node 1. 
This time the rootupgrade.sh completed on node 1. Shut down and startup of the entire server did the trick it seems. 

So I continued with the upgrade by running the rootupgrade.sh on node 2 which also succeeded without any issues. Finally ran the below command to complete the upgrade. 

$GRID_HOME/gridSetup.sh -executeConfigTools -all -silent -responseFile responsefile.rsp

Well, now after everything has been done, the command ended with the below error. 

You can find the logs of this session at:
/u01/oraInventory/logs/GridSetupActions2021-04-29_01-25-13AM

Configuration failed.
[WARNING] [INS-43080] Some of the configuration assistants failed, were cancelled or skipped.
   ACTION: Refer to the logs or contact Oracle Support Services.

$ 

We can also see the below response when querying the active version

[oracle@12r1-rac1 ~]$ crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [18.0.0.0.0]
[oracle@12r1-rac1 ~]$ crsctl query crs activeversion -f
Oracle Clusterware active version on the cluster is [18.0.0.0.0]. The cluster upgrade state is [UPGRADE FINAL]. The cluster active patch level is [70732493].
[oracle@12r1-rac1 ~]$

This is a new change in GI starting 18c where cluvfy performs a post upgrade check and reports the status and doesn't change the status from [UPGRADE FINAL] to [NORMAL] if any of the post upgrade check fails. 

In my case, opening the detailed log /u01/oraInventory/logs/GridSetupActions2021-04-29_01-25-13AM/gridSetupActions2021-04-29_01-25-13AM.log shows the below errors (trimmed output)

...
...
...
INFO:  [Apr 29, 2021 1:42:07 AM] Skipping line: Verifying Single Client Access Name (SCAN) ...PASSED
INFO:  [Apr 29, 2021 1:42:11 AM] Verifying OLR Integrity ...PASSED
INFO:  [Apr 29, 2021 1:42:11 AM] Skipping line: Verifying OLR Integrity ...PASSED
INFO:  [Apr 29, 2021 1:42:17 AM] Verifying Voting Disk ...FAILED (PRVF-5431, PRVF-5449, PRVG-2032)
INFO:  [Apr 29, 2021 1:42:17 AM] Skipping line: Verifying Voting Disk ...FAILED (PRVF-5431, PRVF-5449, PRVG-2032)
INFO:  [Apr 29, 2021 1:42:17 AM] Verifying ASM Integrity ...
INFO:  [Apr 29, 2021 1:42:17 AM] Skipping line: Verifying ASM Integrity ...
INFO:  [Apr 29, 2021 1:42:17 AM]   Verifying Node Connectivity ...
...
...
...
INFO:  [Apr 29, 2021 1:42:22 AM] Skipping line: Verifying Device Checks for ASM ...
INFO:  [Apr 29, 2021 1:42:23 AM]   Verifying Access Control List check ...FAILED (PRVF-9992)
INFO:  [Apr 29, 2021 1:42:23 AM] Skipping line:   Verifying Access Control List check ...FAILED (PRVF-9992)
INFO:  [Apr 29, 2021 1:42:28 AM]   Verifying I/O scheduler ...PASSED
INFO:  [Apr 29, 2021 1:42:28 AM] Skipping line:   Verifying I/O scheduler ...PASSED
INFO:  [Apr 29, 2021 1:42:28 AM] Verifying Device Checks for ASM ...FAILED (PRVF-9992)
INFO:  [Apr 29, 2021 1:42:28 AM] Skipping line: Verifying Device Checks for ASM ...FAILED (PRVF-9992)
INFO:  [Apr 29, 2021 1:42:30 AM] Verifying ASM disk group free space ...PASSED
INFO:  [Apr 29, 2021 1:42:30 AM] Skipping line: Verifying ASM disk group free space ...PASSED
INFO:  [Apr 29, 2021 1:42:30 AM] Verifying User Not In Group "root": oracle ...PASSED
...
...
...

According to support note Doc ID 2583141.1, fix the critical issues that cluvfy post upgrade check reports and rerun cluvfy with -collect cluster to update the status to [NORMAL]. 

[oracle@12r1-rac1 ~]$ $ORACLE_HOME/bin/cluvfy stage -post crsinst -gi_upgrade -n all

Verifying Node Connectivity ...
  Verifying Hosts File ...PASSED
  Verifying Check that maximum (MTU) size packet goes through subnet ...PASSED
  Verifying subnet mask consistency for subnet "192.168.1.0" ...PASSED
  Verifying subnet mask consistency for subnet "192.168.56.0" ...PASSED
Verifying Node Connectivity ...PASSED
...
...
...
Verifying ASM Integrity ...PASSED
Verifying Device Checks for ASM ...
  Verifying Access Control List check ...FAILED (PRVF-9992)
  Verifying I/O scheduler ...PASSED
Verifying Device Checks for ASM ...FAILED (PRVF-9992)
...
...
...
Verifying Access control attributes for %OCRCONFIGDIR%/maps ...PASSED

Post-check for cluster services setup was unsuccessful.
Checks did not pass for the following nodes:
        12r1-rac2,12r1-rac1


Failures were encountered during execution of CVU verification request "stage -post crsinst".

Verifying Voting Disk ...FAILED
PRVF-5431 : Oracle Cluster Voting Disk configuration check failed

12r1-rac2: PRVF-5449 : Check of Voting Disk location "/dev/oracleasm/disk1"
           failed on the following nodes:
12r1-rac2: PRVG-2032 : Group of file "/dev/oracleasm/disk1" did not match the
           expected value on node "12r1-rac2". [Expected = "oinstall" ; Found =
           "dba"]

12r1-rac1: PRVF-5449 : Check of Voting Disk location "/dev/oracleasm/disk1"
           failed on the following nodes:
12r1-rac1: PRVG-2032 : Group of file "/dev/oracleasm/disk1" did not match the
           expected value on node "12r1-rac1". [Expected = "oinstall" ; Found =
           "dba"]

Verifying Device Checks for ASM ...FAILED
  Verifying Access Control List check ...FAILED
  12r1-rac2: PRVF-9992 : Group of device "/dev/oracleasm/disk5" did not match
             the expected group. [Expected = "oinstall"; Found = "dba"] on
             nodes: [12r1-rac2, 12r1-rac1]
  12r1-rac2: PRVF-9992 : Group of device "/dev/oracleasm/disk4" did not match
             the expected group. [Expected = "oinstall"; Found = "dba"] on
...
...
...
  12r1-rac1: PRVF-9992 : Group of device "/dev/oracleasm/disk8" did not match
             the expected group. [Expected = "oinstall"; Found = "dba"] on
             nodes: [12r1-rac2, 12r1-rac1]


CVU operation performed:      stage -post crsinst
Date:                         May 1, 2021 7:37:41 PM
CVU home:                     /oradir1/grid/18/
User:                         oracle
[oracle@12r1-rac1 ~]$

We can see most of the errors are due to group expectation on the ASM disks. I just changed the disk ownership to oracle:oinstall instead of dba by editing the udev rules and reloading and reran the cluvfy for post upgrade which succeeded without issues. 
Next step is to run cluvfy with -collect cluster to update the status of the GI upgrade (trimmed output)

[oracle@12r1-rac1 ~]$ $ORACLE_HOME/bin/cluvfy stage -post crsinst -collect cluster -gi_upgrade -n all

Verifying Node Connectivity ...
  Verifying Hosts File ...PASSED
  Verifying Check that maximum (MTU) size packet goes through subnet ...PASSED
  Verifying subnet mask consistency for subnet "192.168.1.0" ...PASSED
  Verifying subnet mask consistency for subnet "192.168.56.0" ...PASSED
Verifying Node Connectivity ...PASSED
Verifying Multicast or broadcast check ...PASSED
Verifying ASM filter driver configuration consistency ...PASSED
...
...
Verifying ASM Integrity ...PASSED
Verifying Device Checks for ASM ...
  Verifying Access Control List check ...PASSED
  Verifying I/O scheduler ...PASSED

Collecting OS mandatory requirements baseline

Collecting Physical Memory ...collected
Collecting Available Physical Memory ...collected
...
...
...
Collecting Package: ksh ...collected
Collecting Package: make-3.81 ...collected
...
...
...
Collecting HugePages Existence ...collected
Collecting Hardware Clock synchronization at shutdown ...collected
Collecting availability of port 8888 ...collected
Collecting Ethernet Jumbo Frames ...collected
...
...
...
Collecting CSS reboottime parameter ...collected
Collecting CSS disktimeout parameter ...collected


Verifying ASM mandatory requirements

Collecting ASM Initialization Parameters ...collected
Collecting ASM Configuration Collection ...collected
Collecting ASM Disk Group Configuration Collection ...collected


Verifying ASM best practices

Collecting ASM_POWER_LIMIT ...collected
...
...
...
Collecting ASM disk rebalance operations in WAIT status ...collected

Baseline collected.
Collection report for this execution is saved in file "/oradir1/grid/base/crsdata/@global/cvu/baseline/install/grid_install_18.0.0.0.0.zip".

Post-check for cluster services setup was successful.

CVU operation performed:      stage -post crsinst
Date:                         May 1, 2021 8:22:45 PM
CVU home:                     /oradir1/grid/18/
User:                         oracle
[oracle@12r1-rac1 ~]$ crsctl query crs activeversion -f
Oracle Clusterware active version on the cluster is [18.0.0.0.0]. The cluster upgrade state is [NORMAL]. The cluster active patch level is [70732493].
[oracle@12r1-rac1 ~]$

We can see now the cluster upgrade state is [NORMAL] meaning the GI upgrade is a success!

Foot Note: 
This blog post is the one which is closest match to the error which we encountered where the author says regarding Bug 21484367 but I doubt this might have matched as I have not configured so many services as this system being my home lab. Also ASM instance came up without issues after starting the node the next day (the author says after he killed ohasd and oraagent.bin process the GI and evmd came up). I didn't get a chance to try this anyhow. 

References: 
  • How to resolve the cluster upgrade state of [UPGRADE FINAL] after successfully upgrading Grid Infrastructure (GI) to 18c or higher (Doc ID 2583141.1)
  • How to Install / Upgrade/ Clone 12.2 Grid Infrastructure in Silent Mode Using gridSetup.sh (Doc ID 2327772.1)
  • evmd not starting in oracle restart (pfierens.blogspot.com)
  • Various Oracle support notes popping up with search term "ORA 29702 during GI upgrade"

Happy upgrading...!!!