Ok, I was very confident the upgrade would succeed when the upgrade started to run after fixing all the necessary pre requisites and is running without issues. I also took a screen shot of the progress as below after starting the rootupgrade.sh See software version already in 18.0.0.0.0
Then after few minutes of my wait, the upgrade failed on node 1. When the rootupgrade.sh script was started and running during Executing upgrade step 17 of 19: 'UpgradeCluster' step, the below error was encountered.
CRS-2672: Attempting to start 'ora.asm' on '12r1-rac1' CRS-5017: The resource action "ora.asm start" encountered the following error: ORA-01034: ORACLE not available ORA-27101: shared memory realm does not exist Linux-x86_64 Error: 2: No such file or directory Additional information: 4150 Additional information: 67028943 Process ID: 0 Session ID: 0 Serial number: 0 . For details refer to "(:CLSN00107:)" in "/oradir1/grid/base/diag/crs/12r1-rac1/crs/trace/ohasd_oraagent_oracle.trc". CRS-2674: Start of 'ora.asm' on '12r1-rac1' failed CRS-2679: Attempting to clean 'ora.asm' on '12r1-rac1' CRS-2681: Clean of 'ora.asm' on '12r1-rac1' succeeded CRS-2672: Attempting to start 'ora.storage' on '12r1-rac1' CRS-2883: Resource 'ora.storage' failed during Clusterware stack start. CRS-4406: Oracle High Availability Services synchronous start failed. CRS-4000: Command Start failed, or completed with errors. 2021/04/25 17:38:06 CLSRSC-117: Failed to start Oracle Clusterware stack Died at /oradir1/grid/18/crs/install/crsupgrade.pm line 1462.
The alert log of the +ASM1 of 18c home has the following errors.
MMNL started with pid=29, OS id=26058 2021-04-25T17:27:01.603535+05:30 Unable to start the instance due to software incompatiblity 2021-04-25T17:27:01.604426+05:30 ORA-15156: cluster in rolling upgrade from version [12.1.0.1.0] to [18.0.0.0.0] 2021-04-25T17:27:01.853249+05:30 Error: Shutdown in progress. Error: 29702. USER (ospid: 25746): terminating the instance due to ORA error 29702 2021-04-25T17:27:02.925386+05:30 Instance terminated by USER, pid = 25746
Detailed info is available in /oradir1/grid/base/diag/crs/12r1-rac1/crs/trace/ohasd_oraagent_oracle.trc file which has reported the below errors.
... ... ... ... 2021-04-25 17:27:14.156 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstConnection:connectInt 200 connected 2021-04-25 17:27:14.156 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 040 resname:ora.asm startup 2021-04-25 17:27:17.534 :CLSFRAME:2796190720: TM [MultiThread] is changing desired thread # to 5. Current # is 4 2021-04-25 17:27:17.534 :CLSDYNAM:2539628288: [ora.evmd]{0:0:2} [check] DaemonAgent::check returned 0 2021-04-25 17:27:17.534 :CLSDYNAM:2539628288: [ora.evmd]{0:0:2} [check] Deep check returned 0 2021-04-25 17:27:22.679 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] ORA-29702: error occurred in Cluster Group Service operation 2021-04-25 17:27:22.680 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 050 errcode 29702 2021-04-25 17:27:22.680 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 160 ORA-29701 or ORA-29702 instance dumping m_instanceType:2 m_lastOCIError:29702 2021-04-25 17:27:22.680 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 170 ORA-29701 or ORA-29702 shutdown abort 2021-04-25 17:27:22.680 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstConnection::shutdown mode 4 2021-04-25 17:27:32.681 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 260 disConnect 2021-04-25 17:27:32.681 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 020 connect logmode:8008 2021-04-25 17:27:32.681 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstConnection::connectInt 020 server not attached 2021-04-25 17:27:33.726 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstConnection:connectInt 200 connected 2021-04-25 17:27:33.727 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 040 resname:ora.asm startup 2021-04-25 17:27:42.263 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] ORA-29702: error occurred in Cluster Group Service operation 2021-04-25 17:27:42.263 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 050 errcode 29702 2021-04-25 17:27:42.263 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 160 ORA-29701 or ORA-29702 instance dumping m_instanceType:2 m_lastOCIError:29702 2021-04-25 17:27:42.263 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstAgent::startInstance 170 ORA-29701 or ORA-29702 shutdown abort 2021-04-25 17:27:42.263 :CLSDYNAM:2754123520: [ ora.asm]{0:5:3} [start] InstConnection::shutdown mode 4 ... ... ...
This is a weird error right. Error says unable to start the instance due to software incompatibility and you go make a search on Oracle support for "ORA 29702 during GI upgrade", it returns n number of issues with references to a few bugs as well. Nothing seems to be fitting the issue that I have encountered.
One more interesting thing is that, if I set the environment variables to 18c home and start the ASM instance, the instance starts without issue. So I then again shutdown the ASM instance and restarted the rootupgrade.sh script. This time too the same error occurs when Oracle tries to start ASM and the upgrade fails. So, after long search and different attempts, I was not able to fix the issue. I cancelled the upgrade session and decided to look into the issue next day.
On the next day when I start both the nodes, I see that the ASM instance on the first node is up from 18c home and no services started on the second node. I decided to continue the upgrade from where I left. Since we already completed the software set up and our error occurred during the execution of rootupgrade.sh script, I started from the same step after shutting down the entire cluster running on node 1.
This time the rootupgrade.sh completed on node 1. Shut down and startup of the entire server did the trick it seems.
So I continued with the upgrade by running the rootupgrade.sh on node 2 which also succeeded without any issues. Finally ran the below command to complete the upgrade.
$GRID_HOME/gridSetup.sh -executeConfigTools -all -silent -responseFile responsefile.rsp
Well, now after everything has been done, the command ended with the below error.
You can find the logs of this session at: /u01/oraInventory/logs/GridSetupActions2021-04-29_01-25-13AM Configuration failed. [WARNING] [INS-43080] Some of the configuration assistants failed, were cancelled or skipped. ACTION: Refer to the logs or contact Oracle Support Services. $
[oracle@12r1-rac1 ~]$ crsctl query crs activeversion Oracle Clusterware active version on the cluster is [18.0.0.0.0] [oracle@12r1-rac1 ~]$ crsctl query crs activeversion -f Oracle Clusterware active version on the cluster is [18.0.0.0.0]. The cluster upgrade state is [UPGRADE FINAL]. The cluster active patch level is [70732493]. [oracle@12r1-rac1 ~]$
This is a new change in GI starting 18c where cluvfy performs a post upgrade check and reports the status and doesn't change the status from [UPGRADE FINAL] to [NORMAL] if any of the post upgrade check fails.
In my case, opening the detailed log /u01/oraInventory/logs/GridSetupActions2021-04-29_01-25-13AM/gridSetupActions2021-04-29_01-25-13AM.log shows the below errors (trimmed output)
... ... ... INFO: [Apr 29, 2021 1:42:07 AM] Skipping line: Verifying Single Client Access Name (SCAN) ...PASSED INFO: [Apr 29, 2021 1:42:11 AM] Verifying OLR Integrity ...PASSED INFO: [Apr 29, 2021 1:42:11 AM] Skipping line: Verifying OLR Integrity ...PASSED INFO: [Apr 29, 2021 1:42:17 AM] Verifying Voting Disk ...FAILED (PRVF-5431, PRVF-5449, PRVG-2032) INFO: [Apr 29, 2021 1:42:17 AM] Skipping line: Verifying Voting Disk ...FAILED (PRVF-5431, PRVF-5449, PRVG-2032) INFO: [Apr 29, 2021 1:42:17 AM] Verifying ASM Integrity ... INFO: [Apr 29, 2021 1:42:17 AM] Skipping line: Verifying ASM Integrity ... INFO: [Apr 29, 2021 1:42:17 AM] Verifying Node Connectivity ... ... ... ... INFO: [Apr 29, 2021 1:42:22 AM] Skipping line: Verifying Device Checks for ASM ... INFO: [Apr 29, 2021 1:42:23 AM] Verifying Access Control List check ...FAILED (PRVF-9992) INFO: [Apr 29, 2021 1:42:23 AM] Skipping line: Verifying Access Control List check ...FAILED (PRVF-9992) INFO: [Apr 29, 2021 1:42:28 AM] Verifying I/O scheduler ...PASSED INFO: [Apr 29, 2021 1:42:28 AM] Skipping line: Verifying I/O scheduler ...PASSED INFO: [Apr 29, 2021 1:42:28 AM] Verifying Device Checks for ASM ...FAILED (PRVF-9992) INFO: [Apr 29, 2021 1:42:28 AM] Skipping line: Verifying Device Checks for ASM ...FAILED (PRVF-9992) INFO: [Apr 29, 2021 1:42:30 AM] Verifying ASM disk group free space ...PASSED INFO: [Apr 29, 2021 1:42:30 AM] Skipping line: Verifying ASM disk group free space ...PASSED INFO: [Apr 29, 2021 1:42:30 AM] Verifying User Not In Group "root": oracle ...PASSED ... ... ...
According to support note Doc ID 2583141.1, fix the critical issues that cluvfy post upgrade check reports and rerun cluvfy with -collect cluster to update the status to [NORMAL].
[oracle@12r1-rac1 ~]$ $ORACLE_HOME/bin/cluvfy stage -post crsinst -gi_upgrade -n all Verifying Node Connectivity ... Verifying Hosts File ...PASSED Verifying Check that maximum (MTU) size packet goes through subnet ...PASSED Verifying subnet mask consistency for subnet "192.168.1.0" ...PASSED Verifying subnet mask consistency for subnet "192.168.56.0" ...PASSED Verifying Node Connectivity ...PASSED ... ... ... Verifying ASM Integrity ...PASSED Verifying Device Checks for ASM ... Verifying Access Control List check ...FAILED (PRVF-9992) Verifying I/O scheduler ...PASSED Verifying Device Checks for ASM ...FAILED (PRVF-9992) ... ... ... Verifying Access control attributes for %OCRCONFIGDIR%/maps ...PASSED Post-check for cluster services setup was unsuccessful. Checks did not pass for the following nodes: 12r1-rac2,12r1-rac1 Failures were encountered during execution of CVU verification request "stage -post crsinst". Verifying Voting Disk ...FAILED PRVF-5431 : Oracle Cluster Voting Disk configuration check failed 12r1-rac2: PRVF-5449 : Check of Voting Disk location "/dev/oracleasm/disk1" failed on the following nodes: 12r1-rac2: PRVG-2032 : Group of file "/dev/oracleasm/disk1" did not match the expected value on node "12r1-rac2". [Expected = "oinstall" ; Found = "dba"] 12r1-rac1: PRVF-5449 : Check of Voting Disk location "/dev/oracleasm/disk1" failed on the following nodes: 12r1-rac1: PRVG-2032 : Group of file "/dev/oracleasm/disk1" did not match the expected value on node "12r1-rac1". [Expected = "oinstall" ; Found = "dba"] Verifying Device Checks for ASM ...FAILED Verifying Access Control List check ...FAILED 12r1-rac2: PRVF-9992 : Group of device "/dev/oracleasm/disk5" did not match the expected group. [Expected = "oinstall"; Found = "dba"] on nodes: [12r1-rac2, 12r1-rac1] 12r1-rac2: PRVF-9992 : Group of device "/dev/oracleasm/disk4" did not match the expected group. [Expected = "oinstall"; Found = "dba"] on ... ... ... 12r1-rac1: PRVF-9992 : Group of device "/dev/oracleasm/disk8" did not match the expected group. [Expected = "oinstall"; Found = "dba"] on nodes: [12r1-rac2, 12r1-rac1] CVU operation performed: stage -post crsinst Date: May 1, 2021 7:37:41 PM CVU home: /oradir1/grid/18/ User: oracle [oracle@12r1-rac1 ~]$
We can see most of the errors are due to group expectation on the ASM disks. I just changed the disk ownership to oracle:oinstall instead of dba by editing the udev rules and reloading and reran the cluvfy for post upgrade which succeeded without issues.
Next step is to run cluvfy with -collect cluster to update the status of the GI upgrade (trimmed output)
[oracle@12r1-rac1 ~]$ $ORACLE_HOME/bin/cluvfy stage -post crsinst -collect cluster -gi_upgrade -n all Verifying Node Connectivity ... Verifying Hosts File ...PASSED Verifying Check that maximum (MTU) size packet goes through subnet ...PASSED Verifying subnet mask consistency for subnet "192.168.1.0" ...PASSED Verifying subnet mask consistency for subnet "192.168.56.0" ...PASSED Verifying Node Connectivity ...PASSED Verifying Multicast or broadcast check ...PASSED Verifying ASM filter driver configuration consistency ...PASSED ... ... Verifying ASM Integrity ...PASSED Verifying Device Checks for ASM ... Verifying Access Control List check ...PASSED Verifying I/O scheduler ...PASSED Collecting OS mandatory requirements baseline Collecting Physical Memory ...collected Collecting Available Physical Memory ...collected ... ... ... Collecting Package: ksh ...collected Collecting Package: make-3.81 ...collected ... ... ... Collecting HugePages Existence ...collected Collecting Hardware Clock synchronization at shutdown ...collected Collecting availability of port 8888 ...collected Collecting Ethernet Jumbo Frames ...collected ... ... ... Collecting CSS reboottime parameter ...collected Collecting CSS disktimeout parameter ...collected Verifying ASM mandatory requirements Collecting ASM Initialization Parameters ...collected Collecting ASM Configuration Collection ...collected Collecting ASM Disk Group Configuration Collection ...collected Verifying ASM best practices Collecting ASM_POWER_LIMIT ...collected ... ... ... Collecting ASM disk rebalance operations in WAIT status ...collected Baseline collected. Collection report for this execution is saved in file "/oradir1/grid/base/crsdata/@global/cvu/baseline/install/grid_install_18.0.0.0.0.zip". Post-check for cluster services setup was successful. CVU operation performed: stage -post crsinst Date: May 1, 2021 8:22:45 PM CVU home: /oradir1/grid/18/ User: oracle [oracle@12r1-rac1 ~]$ crsctl query crs activeversion -f Oracle Clusterware active version on the cluster is [18.0.0.0.0]. The cluster upgrade state is [NORMAL]. The cluster active patch level is [70732493]. [oracle@12r1-rac1 ~]$
We can see now the cluster upgrade state is [NORMAL] meaning the GI upgrade is a success!
Foot Note:
This blog post is the one which is closest match to the error which we encountered where the author says regarding Bug 21484367 but I doubt this might have matched as I have not configured so many services as this system being my home lab. Also ASM instance came up without issues after starting the node the next day (the author says after he killed ohasd and oraagent.bin process the GI and evmd came up). I didn't get a chance to try this anyhow.
References:
- How to resolve the cluster upgrade state of [UPGRADE FINAL] after successfully upgrading Grid Infrastructure (GI) to 18c or higher (Doc ID 2583141.1)
- How to Install / Upgrade/ Clone 12.2 Grid Infrastructure in Silent Mode Using gridSetup.sh (Doc ID 2327772.1)
- evmd not starting in oracle restart (pfierens.blogspot.com)
- Various Oracle support notes popping up with search term "ORA 29702 during GI upgrade"
Happy upgrading...!!!