Looking for something? Try here..

Thursday, December 29, 2022

Troubleshooting Kerberos and Oracle CMU configuration

This summary is not available. Please click here to view the post.

Monday, December 12, 2022

Performance slowness due to Spatial component

 We had a complaint from application team that their database is running slow after the quarterly patching. We had patched many databases and only one database is complaining on slowness. The environment is Exacc and the database is a container (running version 19.16) which has multiple PDBs. So patching couldn't be a reason is the initial thought. So we stepped into investigate what is causing the slowness. 

As a first step, we pulled an AWR report for the time frame where they did some DB operations. 



We can see from above that most of the waits are related to Cluster and topping them is gc cr multi block mixedgc cr block 2-way and gc buffer busy acquire

So this means some query is running very frequently accessing 1 or 2 table blocks continuously. We need to figure out which query is running, accessing what table/index and how frequent and what's the elapsed time of the query. These information will get us to move forward. 

Since the waits are related to Cluster, let's see SQL ordered by Cluster Wait Time (Global) section under SQL Statistics



We see 2 queries that contribute to the most cluster waits. Both these are related to spatial component as the queries are as below and they had ~ 53k executions in the report
SELECT /* chk_journal_reg */ rowid FROM "MDSYS"."SDO_TXN_JOURNAL_REG" WHERE sid = :1 AND sdo_txn_idx_id = :2 AND indexptniden = :3 AND operation = :4 AND rid = :5

SELECT /* sel_dist_upd */ ROWID, ....., ins_no, del_no FROM "MDSYS"."SDO_DIST_METADATA_TABLE" WHERE sdo_index_owner = :own AND sdo_index_name = :nam AND sdo_partition_name = :ptn FOR UPDATE

Ok, these 2 queries look simple and not complicated, so what's causing this buffer busy issues? Investigation went in following direction

Check size of tables: They are too small. 

Check the execution plan of the first code with high cluster waits gave us the clue. Though we have primary key defined on all five columns (SID, SDO_TXN_IDX_ID, INDEXPTNIDEN, OPERATION, RID), the query took the path of INDEX STORAGE FAST FULL SCAN which would traverse through all the blocks unnecessarily. The query took around 4 second per execution. 

I hint the query to use the index already available (/*+ index(SDO_TXN_JOURNAL_REG SYS_C0014253) */) - this resulted in a much better plan and execution time. Execution time reduced to milliseconds. Remember this query executed around 53k times which would make a big difference in application performance. While checking for any plan change, I found that the index path was used until the patch was applied and the plan flipped

SQL> select distinct sql_id, plan_hash_value, id, operation, options
  2* from dba_hist_sql_plan where sql_id='fw4q2s9j3bckt' order by plan_hash_value, id;

          SQL_ID    PLAN_HASH_VALUE    ID           OPERATION                   OPTIONS
________________ __________________ _____ ___________________ _________________________
fw4q2s9j3bckt            1765869626     0 SELECT STATEMENT
fw4q2s9j3bckt            1765869626     1 INDEX               STORAGE FAST FULL SCAN
fw4q2s9j3bckt            2134037127     0 SELECT STATEMENT
fw4q2s9j3bckt            2134037127     1 INDEX               UNIQUE SCAN

SQL> 
So, we decided to create sql profile so that the plan stick to plan_hash_value 2134037127 using note Encouraging CBO to Pickup a Better Execution Plan Using the COE XFR SQL Profile Script (Doc ID 1955195.1)

In the mean time, Oracle support note (After Applying the 19.16 DBRU, Scheduler Jobs on Spatial Tables Take a Long Time to Execute (Doc ID 2907800.1)) mentioned a bug related to this same table and query. 

Reason for slowness is 

Running a MV refresh using the DBMS_SCHEDULER package causes spatial to use the SDO_TXN_JOURNAL_REG transaction journal table, instead of the SDO_TXN_JOURNAL_GTT transaction journal table.

So, now the solution is to revert the sql profile created and apply spatial bundle bug fix Patch 34725493 or to apply RU 19.17

Since we already had applied another one off patch for a previous issue with Spatial, we couldn't apply this bug fix patch as it was conflicting with the one off patch. Raised an SR with Oracle to get the merge patch for both the conflicting patches but Oracle couldn't provide any commitment on when the merge patch will be ready.

As we need to get the production database back to normal as soon as we can, we had to roll back the entire 19.16 RU to bring it back to pre patch state (RU 19.15) and then the application verified all the jobs are running without issues. We could also not see any cluster related wait post the patch rollback. We rolled back as 19.17 is yet to be tested in our lower environments before we promote to production.

Interestingly, Oracle maintains a note on all the patches that needs to be applied post RU patch apply under What Is The Latest Spatial Patch Bundle for 12c and Higher Databases? (Doc ID 2514624.1) So every time we apply RU to the database, we need to refer this note and apply the patch that is mentioned for the related RU as well. 

This investigation is not only a trouble shooting experience but also a knowledge gathering experience with Oracle Spatial component regarding the bugs related to Spatial and separate Spatial patch bundle, etc.,. 


Happy Trouble shooting and Patching...!!!


Tuesday, November 22, 2022

Oracle Database and Anti Virus Softwares

 If you do a search for antivirus related issues in Oracle support, you will hit many notes talking about AVs causing issues with the Oracle database and this is not specific to any AV but in general to all the AV software. 

Taking a look at Oracle support note How To Configure Anti-Virus On Oracle Database Server (Doc ID 782354.1), Oracle mentions the below

we recommend the following files to be excluded from online anti-virus scanning.

  • Oracle datafiles
  • Control files
  • Redo-log files
  • Archived redo-log files if database is in archive log mode
  • Files with extension '.ora'
  • Password file
  • File in ADR directory
  • Oracle Software directories (including Oracle_HOME and Oracle base)
  • /dev/shm on Linux platform

So, is that it? Are we good if we just exclude all the files mentioned above? 

I'll discuss regarding the 2 scenarios which caused outages to the database w.r.to AV software (I'm not mentioning the names of the AV software here)

Scenario 1: 

We had installed one of the leading AV software on all our DB servers. During the testing phase everything goes smooth. Few files were asked to be excluded as suggested by Oracle and finally agreed to deploy them on production servers after rigorous testing in the lower environments. 

Soon after the production deployment, we started having issues with ASM disk writes as the ASM disks were invisible to Oracle. The following error pops up frequently on busy and high load databases. 

ORA-00202: control file: '+DATA/ABC/cntrlabc.dbf'~ORA-15081: failed to submit an I/O operation to a disk
ORA-00312: online log 5 thread 1: '+LOG/ABC/ONLINELOG/group_5.261.981173379'~ORA-15081: failed to submit an I/O operation to a disk
So what's really happening here? 

We are using udev rules to define the ASM disks on all our environments (old school but still works effectively). Our assumption is the AV software locks or prevents other processes from reading the udev rules files or the ASM disk header exactly during Oracle tries read/write udev rules file or the ASM disk header. The exact cause is not known till now as the incident is not reproducible at will.

To get out of this situation, we need to reload the udev rules and things will get back to normal. This is no joke as the busy database might lose data or the operations will get stalled when the control file/data file is missing from where it has to be. 
 
We worked with Oracle support who couldn't figure out anything specific from the logs provided and also we worked closely with the AV software vendor close to 8 months to trap the error/incident by trying different tracing and testing same workload during same previous issue time, etc., but as I mentioned we couldn't trap the issue and the tracing can't be directly enabled on the production database server for long times due to enormous logs the tracing produces, performance impact, etc. 

In the mean time, multiple version updates with minor to major file handling updates w.r.to Oracle database were introduced to the AV software but none of them help prevent the issue from occurring infrequently all of a sudden which made the business people unhappy.  
Finally, business agreed to move to another AV vendor 😜

The horror story continues... 

Scenario 2: 

During the 8 months time where all the R&D was happening with the previous vendor, there comes another AV software vendor with all the super cool features that one system can have to get rid of all the virus (just like the complete dose of COVID vaccine and on top of that a booster dose too). 
So just like the previous one, the testing went with flying colors. No outages like disk not visible, etc. was seen and no performance slowness reported in the lower environments and also a pilot was done on one of low risk production database server as well. 

Finally, it was decided to take the AV software to production along with other changes we usually do such as OS and DB patching every quarter. The deployment went fine for all the production DB servers with no issues reported during the maintenance window and post maintenance testing window. All are happy, all went to celebrate and sleep peacefully. 

Next day, business began as usual.. Time passed slowly.. Application team noticed slight delay in batch processing for one or other jobs.. They still were monitoring thinking it's an one off case and waited for sometime.. Performance is now getting a huge hit on one of the very critical OLTP database and the entire database was slowing down.. we had multiple tickets bombarding and eventually a high priority ticket was assigned and members from different groups (Application team, OS admins, Infra admins, DBAs) jumped on to the bridge to check from their end.

Initial checks with DB showed almost all the queries were running slow. This is not an ideal situation where all the queries would get hit. So the issue seems to be something out of database. Checks on OS shows that run queue for the CPU is too high and the time spent running the kernel is taking high percentage of the CPU. This is definitely unusual. 


You can see the load average is near 400 and this system is a 72 core system which translates to a load of ~ 5.5 which is too high. Interestingly, unlike all other days the kernel processing is taking > 50% (ideally should be < 10%) is a hint saying something messed up at the OS/kernel. 

So the first thing we did is to check for the AV activity and stop that right there as it's the only alien on the system. OS admin disabled the AV software. Still the CPU usage didn't come down. So we guessed it to be some other thing that's causing the issue. Remember we did OS and Database patching as well. So the plan now is to rollback both the patches one after another to figure out the cause. 

Meanwhile, support from vendors such as Oracle, Redhat and SAP (since the application is SAP) were also involved via high priority tickets to each of them. Oracle and SAP clearly mention that the cause might be from OS as the kernel processing is unusually high. So first we rolled back the kernel to previous version and rebooted the entire system. The issue was still persistent. Database patch was then rolled back and still the CPU usage didn't reduce to normal levels. 

We then installed perf tool which is not a standard package that comes along with Linux as Redhat needs some finer details to pin point the cause. Take a look here to learn much about perf
With the OS team providing the required details to Redhat support (I have limited knowledge on what traces were provided but guess perf stat trace), Redhat support got back to us with this module event_collector_2_3_xxx which is a third party module got loaded to the kernel and causing the kernel to suck up all the CPU. This module is from the AV software. We have to uninstall completely of the AV software and take a clean reboot to get rid of the module from kernel. 

One thing to understand is that though we disabled the AV software as first step of our troubleshooting, the same AV software was identified as culprit causing the slowness issue after very long investigation. This seriously would have been avoided if we know how the AV would behave if we just disable it.

So, what's the solution for this? 

Oracle doesn't certify any AV software to work properly with Oracle databases. We can't leave our production databases/servers unprotected. So we can't totally get rid of anti virus softwares at all. In the same time, we can't risk unexpected production downtime as we saw in the above 2 cases of the multiple other probabilities that might occur. 

What we can try is other alternatives than relying upon AVs such as below
  1. DB server should be restricted only for DB admin login, specifically only a senior DBA who does admin work (such as startup and shutdown using a sys account) and not the management work like adding/extending tablespaces kind of tasks. These other activities can be done remote using system or customized DBA account.  This makes sure very limited logins are available on the server directly unlike we have 1 common account (in most cases oracle) where all the members in team use to login. This poses a very serious security risk
  2. Place all DB servers under a subnet with tight firewalls such as auto log off in x seconds/minutes, trace entire activity and audit frequently on what commands been run, etc. Only SA and DBA should have direct login accounts. 
  3. OS login account should be coupled with AD and should use MFA to authenticate. The configuration seems complicated but this is one time configuration that needs the initial efforts to set up and we are done. 
  4. To limit direct access to DB, use kerberos/radius authentication or Centrally Managed Users (available from version 18c) which implements coupling enterprise users with AD accounts and groups which would allow users with their enterprise logins. Again this provides double protection as once the member leaves the organization, his/her account is totally wiped out making the server/DB secure 
  5. Use TCP.VALIDNODE_CHECKING = YES , TCP.EXCLUDED_NODES, TCP.INVITED_NODES sqlnet parameters to accept connections only from specific ip address or subnet making the application access to database super strong. 
These are the few things I can think of to secure the database/server from exploitation which would eliminate or minimize the requirement for an anti virus software. 

Let me know in comments of various other implementations that you come across to secure the database/servers in your organization so that we can help each other learn and secure our databases. 

References:

How To Configure Anti-Virus On Oracle Database Server (Doc ID 782354.1)What is TCP_VALIDNODE_CHECKING and How to Use It (Doc ID 462933.1)

Happy Anti Virusing...!!!

Tuesday, August 23, 2022

Oracle Centrally Managed Users (CMU) configuration

Centrally managed users (CMU) provides a simpler integration with Microsoft Active Directory to allow centralized authentication and authorization of users. 

Quote from official documentation
"In order for the Oracle Database CMU with Active Directory integration to work, the Oracle database must be able to login to a service account specifically created for the database in Active Directory. The database uses this service account to query Active Directory for user and group information when a user logs into the database. This Active Directory service account must have all the privileges required to query the user and group information as well as being able to write updates related to the password policies in Active Directory (for example, failed login attempts, clear failed login attempts). Users can authenticate using passwords, Kerberos, or PKI and either be assigned to an exclusive schema or a shared schema. Mapping of an Active Directory user to a shared schema is determined by the association of the user to an Active Directory group that is mapped to the shared schema. Active Directory groups can also be mapped to database global roles. An Active Directory security administrator can assign a user to groups that are mapped to shared database global users (schemas) and/or database global roles, and hence update privileges and roles that are assigned to the Active Directory user in a database."

In this post, I'll be explaining how to configure Oracle Centrally Managed Users using Windows server 2019 Active Directory and Oracle database version 19c. 


Since the configuration is a bit long with multiple screen shots I have made, I have divided the steps into 4 major parts which are as below

  • Part A: AD Configuration
  • Part B: Database Server Configuration
  • Part C: Enhancing to CMU
  • Part D: Enhancing CMU with Shared users and roles

All the steps are explained in the downloadable document. Click below link to download the file.

>>> Centrally Managed Users - orabliss.com <<<

Please make use of the document and share the feedback or any clarification if you have, I'll try to help solve them. 

I had to set up my own AD on virtual box for which I have not covered the steps in the document but the first 2 reference links will help you set up one if you need to.. 

References: 


Happy CMing Users!!! :)

Thursday, June 30, 2022

RMAN Merged incremental backups

 We are changing our backup solutions vendor in our company and the new vendor utilizes the RMAN merged incremental backup. So let's see what is this and what are the advantages of this RMAN capability. 

Merged Incremental creates a level 0 image copy backup of all of your database's datafiles in a disk location. All datafiles are backed up using the same tag name. Block Change Tracking can be used for the incremental level 1 backups for Fast Incremental Backups


Let's see a quick demo on how to perform a Merged Incremental backup
run {
allocate channel c1 device type disk format '/media/sf_Oracle/RMAN-training/%U';
recover copy of database with tag 'Demo_merged_l0';
backup incremental level 1 copies=1 for recover of copy with tag 'Demo_merged_l0' database;
}
This is the default strategy which OEM uses. Let's say we are scheduling backup every day at 12 pm, then the copy of datafiles will be 24 hrs behind. 

First day: Run 1
[oracle@linux-8 RMAN-training]$ rman target sys/oracle@odb1

Recovery Manager: Release 19.0.0.0.0 - Production on Thu Jun 30 01:10:09 2022
Version 19.10.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

connected to target database: CDB19:ODB1 (DBID=223390358)

RMAN> run {
allocate channel c1 device type disk format '/media/sf_Oracle/RMAN-training/%U';
recover copy of database with tag 'Demo_merged_l0';
backup incremental level 1 copies=1 for recover of copy with tag 'Demo_merged_l0' database;
}2> 3> 4> 5>

using target database control file instead of recovery catalog
allocated channel: c1
channel c1: SID=414 device type=DISK

Starting recover at 30-JUN-22
no copy of datafile 16 found to recover
no copy of datafile 17 found to recover
no copy of datafile 18 found to recover
no copy of datafile 19 found to recover
Finished recover at 30-JUN-22

Starting backup at 30-JUN-22
no parent backup or copy of datafile 18 found
no parent backup or copy of datafile 16 found
no parent backup or copy of datafile 17 found
no parent backup or copy of datafile 19 found
channel c1: starting datafile copy
input datafile file number=00018 name=+DATA/ODB1/DATAFILE/undotbs1.273.1084998671
output file name=/media/sf_Oracle/RMAN-training/data_D-CDB19_I-3846007102_TS-UNDOTBS1_FNO-18_kr11ag49 tag=DEMO_MERGED_L0 RECID=95 STAMP=1108689035
channel c1: datafile copy complete, elapsed time: 00:00:03
channel c1: starting datafile copy
input datafile file number=00016 name=+DATA/ODB1/DATAFILE/system.272.1084998669
output file name=/media/sf_Oracle/RMAN-training/data_D-CDB19_I-3846007102_TS-SYSTEM_FNO-16_ks11ag4c tag=DEMO_MERGED_L0 RECID=96 STAMP=1108689038
channel c1: datafile copy complete, elapsed time: 00:00:03
channel c1: starting datafile copy
input datafile file number=00017 name=+DATA/ODB1/DATAFILE/sysaux.274.1084998671
output file name=/media/sf_Oracle/RMAN-training/data_D-CDB19_I-3846007102_TS-SYSAUX_FNO-17_kt11ag4f tag=DEMO_MERGED_L0 RECID=97 STAMP=1108689040
channel c1: datafile copy complete, elapsed time: 00:00:01
channel c1: starting datafile copy
input datafile file number=00019 name=+DATA/CDB19/CD7591F2458BB5B4E053B538A8C08E48/DATAFILE/users.275.1108164853
output file name=/media/sf_Oracle/RMAN-training/data_D-CDB19_I-3846007102_TS-USERS_FNO-19_ku11ag4g tag=DEMO_MERGED_L0 RECID=98 STAMP=1108689040
channel c1: datafile copy complete, elapsed time: 00:00:01
Finished backup at 30-JUN-22
released channel: c1

RMAN> exit


Recovery Manager complete.
[oracle@linux-8 RMAN-training]$ ls -lrt
...
-rwxrwx---. 1 root vboxsf 1447043072 Jun 30 01:10 data_D-CDB19_I-3846007102_TS-UNDOTBS1_FNO-18_kr11ag49
-rwxrwx---. 1 root vboxsf 1258299392 Jun 30 01:10 data_D-CDB19_I-3846007102_TS-SYSTEM_FNO-16_ks11ag4c
-rwxrwx---. 1 root vboxsf  650125312 Jun 30 01:10 data_D-CDB19_I-3846007102_TS-SYSAUX_FNO-17_kt11ag4f
-rwxrwx---. 1 root vboxsf   41951232 Jun 30 01:10 data_D-CDB19_I-3846007102_TS-USERS_FNO-19_ku11ag4g
 
We can see we get "no copy of database # found to recover" as this is the first run and we don't have any copy made before to recover. 
Also the backup command throws "no parent backup or copy of datafile # found" since we are triggering a incremental level 1 backup without a prior level 0 backup. In this case, Oracle will create a fresh level 0 copy of the datafiles. This can be identified using "channel c1: starting datafile copy" without the mention of incremental level 1 which we will see in the next run. 

Second day: Run 2

RMAN> run {
allocate channel c1 device type disk format '/media/sf_Oracle/RMAN-training/%U';
recover copy of database with tag 'Demo_merged_l0';
backup incremental level 1 copies=1 for recover of copy with tag 'Demo_merged_l0' database;
}2> 3> 4> 5>

allocated channel: c1
channel c1: SID=414 device type=DISK

Starting recover at 30-JUN-22
no copy of datafile 16 found to recover
no copy of datafile 17 found to recover
no copy of datafile 18 found to recover
no copy of datafile 19 found to recover
Finished recover at 30-JUN-22

Starting backup at 30-JUN-22
channel c1: starting compressed incremental level 1 datafile backup set
channel c1: specifying datafile(s) in backup set
input datafile file number=00018 name=+DATA/ODB1/DATAFILE/undotbs1.273.1084998671
input datafile file number=00016 name=+DATA/ODB1/DATAFILE/system.272.1084998669
input datafile file number=00017 name=+DATA/ODB1/DATAFILE/sysaux.274.1084998671
input datafile file number=00019 name=+DATA/CDB19/CD7591F2458BB5B4E053B538A8C08E48/DATAFILE/users.275.1108164853
channel c1: starting piece 1 at 30-JUN-22
channel c1: finished piece 1 at 30-JUN-22
piece handle=/media/sf_Oracle/RMAN-training/kv11agra_671_1_1 tag=DEMO_MERGED_L0 comment=NONE
channel c1: backup set complete, elapsed time: 00:00:03
Finished backup at 30-JUN-22
released channel: c1

RMAN> exit
[oracle@linux-8 RMAN-training]$ ls -lrt
...
-rwxrwx---. 1 root vboxsf 1447043072 Jun 30 01:10 data_D-CDB19_I-3846007102_TS-UNDOTBS1_FNO-18_kr11ag49
-rwxrwx---. 1 root vboxsf 1258299392 Jun 30 01:10 data_D-CDB19_I-3846007102_TS-SYSTEM_FNO-16_ks11ag4c
-rwxrwx---. 1 root vboxsf  650125312 Jun 30 01:10 data_D-CDB19_I-3846007102_TS-SYSAUX_FNO-17_kt11ag4f
-rwxrwx---. 1 root vboxsf   41951232 Jun 30 01:10 data_D-CDB19_I-3846007102_TS-USERS_FNO-19_ku11ag4g
-rwxrwx---. 1 root vboxsf    2482176 Jun 30 01:22 kv11agra_671_1_1

We can still see "no copy of database # found to recover" as we don't have a previous level 1 backup. Remember the run 1 created a level 0 backup. Now, second created a incremental level 1 backup which you can see from "starting compressed incremental level 1 datafile backup set" line and also from the files list kv11agra_671_1_1

Third day: Run 3 and above
RMAN> run {
allocate channel c1 device type disk format '/media/sf_Oracle/RMAN-training/%U';
recover copy of database with tag 'Demo_merged_l0';
backup incremental level 1 copies=1 for recover of copy with tag 'Demo_merged_l0' database;
}2> 3> 4> 5>

using target database control file instead of recovery catalog
allocated channel: c1
channel c1: SID=414 device type=DISK

Starting recover at 30-JUN-22
channel c1: starting incremental datafile backup set restore
channel c1: specifying datafile copies to recover
recovering datafile copy file number=00016 name=/media/sf_Oracle/RMAN-training/data_D-CDB19_I-3846007102_TS-SYSTEM_FNO-16_ks11ag4c
recovering datafile copy file number=00017 name=/media/sf_Oracle/RMAN-training/data_D-CDB19_I-3846007102_TS-SYSAUX_FNO-17_kt11ag4f
recovering datafile copy file number=00018 name=/media/sf_Oracle/RMAN-training/data_D-CDB19_I-3846007102_TS-UNDOTBS1_FNO-18_kr11ag49
recovering datafile copy file number=00019 name=/media/sf_Oracle/RMAN-training/data_D-CDB19_I-3846007102_TS-USERS_FNO-19_ku11ag4g
channel c1: reading from backup piece /media/sf_Oracle/RMAN-training/kv11agra_671_1_1
channel c1: piece handle=/media/sf_Oracle/RMAN-training/kv11agra_671_1_1 tag=DEMO_MERGED_L0
channel c1: restored backup piece 1
channel c1: restore complete, elapsed time: 00:00:01
Finished recover at 30-JUN-22

Starting backup at 30-JUN-22
channel c1: starting compressed incremental level 1 datafile backup set
channel c1: specifying datafile(s) in backup set
input datafile file number=00018 name=+DATA/ODB1/DATAFILE/undotbs1.273.1084998671
input datafile file number=00016 name=+DATA/ODB1/DATAFILE/system.272.1084998669
input datafile file number=00017 name=+DATA/ODB1/DATAFILE/sysaux.274.1084998671
input datafile file number=00019 name=+DATA/CDB19/CD7591F2458BB5B4E053B538A8C08E48/DATAFILE/users.275.1108164853
channel c1: starting piece 1 at 30-JUN-22
channel c1: finished piece 1 at 30-JUN-22
piece handle=/media/sf_Oracle/RMAN-training/l011ahbn_672_1_1 tag=DEMO_MERGED_L0 comment=NONE
channel c1: backup set complete, elapsed time: 00:00:03
Finished backup at 30-JUN-22
released channel: c1

RMAN> exit
[oracle@linux-8 RMAN-training]$ ls -lrt
...
-rwxrwx---. 1 root vboxsf    2482176 Jun 30 01:22 kv11agra_671_1_1
-rwxrwx---. 1 root vboxsf   41951232 Jun 30 01:31 data_D-CDB19_I-3846007102_TS-USERS_FNO-19_ku11ag4g
-rwxrwx---. 1 root vboxsf  650125312 Jun 30 01:31 data_D-CDB19_I-3846007102_TS-SYSAUX_FNO-17_kt11ag4f
-rwxrwx---. 1 root vboxsf 1447043072 Jun 30 01:31 data_D-CDB19_I-3846007102_TS-UNDOTBS1_FNO-18_kr11ag49
-rwxrwx---. 1 root vboxsf 1258299392 Jun 30 01:31 data_D-CDB19_I-3846007102_TS-SYSTEM_FNO-16_ks11ag4c
-rwxrwx---. 1 root vboxsf    2580480 Jun 30 01:31 l011ahbn_672_1_1

Now we can see the previous level 1 backup has been applied on the datafile copies and also an incr level 1 copy has been taken. 
You can see the copy is been recovered the next day and hence the copy will always be 24 hrs behind.

We can change the strategy to make the copies current by flipping the backup and recover commands like below. 
run {
allocate channel c1 device type disk format '/media/sf_Oracle/RMAN-training/%U';
backup incremental level 1 copies=1 for recover of copy with tag 'Demo_merged_l0' database;
recover copy of database with tag 'Demo_merged_l0';
}
Coming to retention...

The above strategy will work for us if we need to restore the backup on the same day or from the period the datafile copy is made. Now, if we need to have a retention for a longer period say 30 days, then we can lag the recover by 30 days by using the below command
RECOVER COPY OF DATABASE WITH TAG 'Demo_merged_l0' UNTIL TIME "SYSDATE-31" FROM TAG 'Demo_merged_l0';
Block change tracking can also be used to take the incremental backups. In order to track 30 days of incr level 1, the hidden parameter _bct_bitmaps_per_file should be increased from its default value of 8 (8 days) to 30. 

Advantages: 

For VLDBs, this solution will provide quicker backups as only the 1st day we will take the full backup and then it will be continuous ever incremental backup only.

References: 

Happy Backups...!!!

Thursday, May 26, 2022

Script to gather details while using Datapump for DB migration

 Here is a script that can be used to gather source database details when we are attempting to upgrade from one version of DB to another or migrate Oracle DB from one server to another. 

The purpose is to gather as much details so that the target DB can be prepared in such a way we don't have much work post import of the required schemas. 

/* -----------------------------------------------------------------------------------------
Script name 	: export_details.sql
Usage			: sqlplus / as sysdba @export_details.sql

Instructions	: 
Replace all SCHEMA1 and SCHEMA2 with the schemas you are intended to export 
and save the file as export_details.sql, use as many users needed

Important Note	: 
Once details gathered, export NLS_LANG as specified below to set env variable
and start export by using the options mentioned in the end of this file

export NLS_LANG=AMERICAN_AMERICA.AL32UTF8 - Check what is the NLS_LANG needed as per app

This script will create 3 o/p files, 1 html and 2 sql files.
------------------------------------------------------------------------------------------ */

set echo on pages 999 long 90000
col OWNER for a30
col USERNAME for a30
col PROFILE for a30
col GRANTED_ROLE for a30
col GRANTEE for a30
col PROPERTY_NAME for a40
col PROPERTY_VALUE for a40
col DISPLAY_VALUE for a50
col DEFAULT_VALUE for a50
set markup html on spool on pre off
spool source_details.html

Prompt Parameter details:
------ ------------------
select name, DISPLAY_VALUE from v$parameter where name in ('service_names', 'compatible','sga_max_size','sga_target','log_buffer','db_cache_size','pga_aggregate_target', 'pga_aggregate_limit','cpu_count','session_cached_cursors','open_cursors','processes') order by name;

Prompt User details:
------ -------------
select count(*) "Total Schemas" from dba_users;
select sum(bytes)/1024/1024/1024 Total_DB_used_Size_In_GB from dba_segments;
select sum(bytes)/1024/1024/1024 Total_DB_allocated_Size_In_GB from dba_data_files;
SELECT owner,sum(bytes)/1024/1024/1024 Size_In_GB from dba_segments WHERE owner IN ('SCHEMA1','SCHEMA2') group by owner order by owner;

Prompt Info about Schemas:
------ -------------------
select USERNAME,ACCOUNT_STATUS,LOCK_DATE,EXPIRY_DATE,DEFAULT_TABLESPACE,TEMPORARY_TABLESPACE,PROFILE from dba_users where USERNAME in ('SCHEMA1','SCHEMA2') order by 1;

set markup html on spool on pre on
Prompt Profile details for user:
------ -------------------------
SELECT dbms_metadata.get_ddl('PROFILE', p.profile) || ';' from dba_profiles p where profile not in ('DEFAULT') and resource_name='COMPOSITE_LIMIT';

set markup html on spool on pre off
Prompt Tablespace Size for particular schema:
------ --------------------------------------
select owner, TABLESPACE_NAME, sum(BYTES)/1024/1024/1024 Size_In_GB from dba_segments where OWNER in ('SCHEMA1','SCHEMA2') group by owner,TABLESPACE_NAME order by owner,TABLESPACE_NAME;

set markup html on spool on pre on
Prompt DDL for Schema, Non default Roles and Tablespace:
------ -------------------------------------------------
SELECT dbms_metadata.get_ddl('USER',u.username) || ';' as USERS from dba_users u where username in ('SCHEMA1','SCHEMA2');
select dbms_metadata.get_ddl('TABLESPACE',tb.tablespace_name) || ';' as Tablespaces from dba_tablespaces tb where tablespace_name not in ('SYSTEM','SYSAUX','UNDOTBS','TEMP');
select dbms_metadata.get_ddl('ROLE', r.role) || ';' as Roles from dba_roles r where ORACLE_MAINTAINED<>'Y';

set markup html on spool on pre off
Prompt Temp tablespace size:
------ ---------------------
select tablespace_name, sum(BYTES)/1024/1024/1024 from dba_temp_files group by tablespace_name;

Prompt Objects counts:
------ ---------------
select owner, object_type, status, count(*) from dba_objects where owner in ('SCHEMA1','SCHEMA2') group by object_type,owner, status order by 1,2,3;

Prompt Roles:
------ ------
select GRANTEE, GRANTED_ROLE, ADMIN_OPTION from dba_role_privs where grantee in ('SCHEMA1','SCHEMA2') order by 1,2;
select GRANTEE, PRIVILEGE, ADMIN_OPTION from dba_sys_privs where grantee in ('SCHEMA1','SCHEMA2') order by 1,2;


Prompt System privileges having a property value = 1
Prompt Use specific DBMS packages such as DBMS_RESOURCE_MANAGER_PRIVS to provide grants to these objects. 
------ --------------------------------------------------------------------------------------------------
select p.grantee, m.name from system_privilege_map m, dba_sys_privs p where m.name=p.privilege and m.property=1 and p.grantee in ('SCHEMA1','SCHEMA2') order by p.grantee, m.name;

Prompt DB Time Zone Details:
------ ---------------------
select sysdate,systimestamp,current_timestamp,sessiontimezone,dbtimezone from dual;

Prompt DB Characterset:
------ ----------------
select PROPERTY_NAME,PROPERTY_VALUE from database_properties where PROPERTY_NAME in ('NLS_CHARACTERSET','NLS_NCHAR_CHARACTERSET');

spool off

Prompt Generating grants script:
------ -------------------------
set markup html off
set lines 200 pages 0 head off feed off
spool source_all_privs.sql

select 'grant ' || privilege ||' on '|| owner || '.' || table_name ||' to '||grantee||';' from dba_tab_privs where grantee in ('SCHEMA1','SCHEMA2') and table_name NOT like 'BIN%';
spool off

spool source_all_synonyms.sql

SELECT 'CREATE OR REPLACE PUBLIC SYNONYM '||SYNONYM_NAME||' FOR '||TABLE_OWNER||'.'||TABLE_NAME||';' FROM ALL_SYNONYMS WHERE OWNER='PUBLIC' AND TABLE_OWNER IN ('SCHEMA1','SCHEMA2');

spool off

exit
Once the details are gathered, we can use the generated html file to prepare the target DB like creating required tablespaces, roles, etc. 
Once the import is completed in the target DB, both the sql scripts can be run against target DB to make sure all the grants are provided and public synonyms are created. 

Let me know in comments on whether we can further add any details which might be needed for collecting details on source DB. 

Also check, Datapump & system privileges with property value = 1 for details on this part. 

Happy Datapumping...!!!

Saturday, May 7, 2022

Datapump & system privileges with property value = 1

Many automation tools like Autoupgrade (a sample demo is provided in this link), Zero Downtime Migration (ZDM), etc has been released by Oracle to make DBA life easier. Though we have these tools, we always run towards the most preferred and simple method of Datapump whether to jump the DBs from one server to another or one version to another if it can be done within the allowed downtime. It is such a powerful utility used by almost every DBA in their work life.

In one of my previous post, I have explained regarding few tips to make datapump job perform better. They still stand true for current version of Oracle as well with many advancements included.


Today we see regarding the privileges with property value =1 and its effect on datapump. 
In my recent migration of DB from version 11.2.0.4 to 19c, post migration using datapump, application complained stating few privileges are missing in the 19c DB. This can't be possible as I have taken a completed schema export and imported in the target database without any errors reported in the impdp logfile.
Upon investigating, ADMINISTER RESOURCE MANAGER privilege has not granted in the 19c DB though there are no errors in both expdp and impdp logfile. 
Taking a look at this Doc ID 1163383.1, a few privileges are not granted normally via grant statements but they have to be granted through specific plsql packages. 

How do we find those privileges? - The below query can provide and the result differs w.r.to DB version. Example from my source system 11g. 
SQL> select name from system_privilege_map where property=1 order by name;

NAME
----------------------------------------
ADMINISTER RESOURCE MANAGER
ALTER ANY EVALUATION CONTEXT
ALTER ANY RULE
ALTER ANY RULE SET
CREATE ANY EVALUATION CONTEXT
CREATE ANY RULE
CREATE ANY RULE SET
CREATE EVALUATION CONTEXT
CREATE RULE
CREATE RULE SET
DEQUEUE ANY QUEUE

NAME
----------------------------------------
DROP ANY EVALUATION CONTEXT
DROP ANY RULE
DROP ANY RULE SET
ENQUEUE ANY QUEUE
EXECUTE ANY EVALUATION CONTEXT
EXECUTE ANY RULE
EXECUTE ANY RULE SET
MANAGE ANY FILE GROUP
MANAGE ANY QUEUE
MANAGE FILE GROUP
READ ANY FILE GROUP

22 rows selected.

SQL>
All the above are granted via specific packages or via explicit grants on the system. 

How do we find which privileges does our source schema has from the above list which needs to be granted explicitly via packages? - The below query is an example
SQL> select p.grantee, m.name from system_privilege_map m, dba_sys_privs p
  2  where m.name=p.privilege and m.property=1 and p.grantee in ('DUMMY')
  3  order by p.grantee, m.name;

GRANTEE                        NAME
------------------------------ ----------------------------------------
DUMMY                          ADMINISTER RESOURCE MANAGER
DUMMY                          READ ANY FILE GROUP

SQL>
I have got 2 of such privileges that needs explicit grants on the target DB. 

Now, on the 19c DB we need to grant the privileges as below. A few would require explicit grant (like the READ ANY FILE GOUP) and  few would require grants via packages as below (like the ADMINISTER RESOURCE MANAGER). 
SQL> @check_user_privs.sql

GRANTEE                        TYP PRIVILEGE OR ROLE
------------------------------ --- ---------------------------------------------------------------------------
DUMMY                          PRV ALTER SESSION
DUMMY                              CREATE ANY DIRECTORY
DUMMY                              CREATE DATABASE LINK
DUMMY                              CREATE JOB
DUMMY                              CREATE PROCEDURE
DUMMY                              CREATE SEQUENCE
DUMMY                              CREATE SESSION
DUMMY                              CREATE SYNONYM
DUMMY                              CREATE TABLE
DUMMY                              CREATE TYPE
DUMMY                              CREATE VIEW
DUMMY                              SELECT ANY DICTIONARY
DUMMY                              UNLIMITED TABLESPACE


13 rows selected.

SQL> BEGIN
  DBMS_RESOURCE_MANAGER_PRIVS.GRANT_SYSTEM_PRIVILEGE(
   GRANTEE_NAME   => 'DUMMY',
   PRIVILEGE_NAME => 'ADMINISTER_RESOURCE_MANAGER',
   ADMIN_OPTION   => FALSE);
END;
/  2    3    4    5    6    7

PL/SQL procedure successfully completed.

SQL> grant READ ANY FILE GROUP to DUMMY;

Grant succeeded.

SQL> @check_user_privs.sql

GRANTEE                        TYP PRIVILEGE OR ROLE
------------------------------ --- ---------------------------------------------------------------------------
DUMMY                          PRV ADMINISTER RESOURCE MANAGER
DUMMY                              ALTER SESSION
DUMMY                              CREATE ANY DIRECTORY
DUMMY                              CREATE DATABASE LINK
DUMMY                              CREATE JOB
DUMMY                              CREATE PROCEDURE
DUMMY                              CREATE SEQUENCE
DUMMY                              CREATE SESSION
DUMMY                              CREATE SYNONYM
DUMMY                              CREATE TABLE
DUMMY                              CREATE TYPE
DUMMY                              CREATE VIEW
DUMMY                              READ ANY FILE GROUP
DUMMY                              SELECT ANY DICTIONARY
DUMMY                              UNLIMITED TABLESPACE


15 rows selected.

SQL>
Now, we are all set. So while performing export and import we need to make sure to take care of these privilege grants without fail for the application to run without any issues. 

Note: Similarly, grants on sys owned objects will also be not transferred via datapump to target database. They have to be explicitly granted on the target database. 

Query used: 

check_user_privs.sql
select grantee, 'PRV' type, privilege pv 
from dba_sys_privs where grantee = 'DUMMY' union
select username grantee, '---' type, 'empty user ---' pv from dba_users 
where not username in (select distinct grantee from dba_role_privs) and
not username in (select distinct grantee from dba_sys_privs) and 
not username in (select distinct grantee from dba_tab_privs) and username like 'DUMMY'
group by username
order by grantee, type, pv;

References: 
Primary Note For Privileges And Roles (Doc ID 1347470.1)

Monday, March 21, 2022

PDB Cloning random time out issue

 This post is to discuss about a similar issue I encountered in the previous blog post

When I tried to clone the pluggable database from one cluster to another, I did get the following error. 

SQL> CREATE PLUGGABLE DATABASE PRD01 from QA01@QA01 keystore identified by xxxxx no data;
CREATE PLUGGABLE DATABASE PRD01 from QA01@QA01 keystore identified by xxxxx no data
*
ERROR at line 1:
ORA-65169: error encountered while attempting to copy file
+DATAC7/XXX04/D84A38E95B599A60E053E1A1C30AC20D/DATAFILE/undo_14.847.1096976301
ORA-17627: ORA-12170: TNS:Connect timeout occurred
ORA-17629: Cannot connect to the remote database server

SQL> 

Seems somethings is timing out and we need to figure out where. So it's the alert log which will help us. The contents of the alert log is as below 

***********************************************************************

Fatal NI connect error 12170, connecting to:
 (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(Host=dev2-scan.xxx.com)(Port=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=QA01_SVC.xxx.com)(CID=(PROGRAM=oracle)(HOST=s23)(USER=oracle))))

  VERSION INFORMATION:
        TNS for Linux: Version 19.0.0.0.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production
  Version 19.12.0.0.0
  Time: 18-MAR-2022 18:07:48
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 0
    nt OS err code: 0
Errors in file /u02/app/oracle/diag/rdbms/xxx04/XXX041/trace/XXX041_ora_295605.trc:
ORA-17627: ORA-12170: TNS:Connect timeout occurred
ORA-17629: Cannot connect to the remote database server
********************************************************
Undo Create of Pluggable Database PRD01 with pdb id - 3.
********************************************************
Sometimes, I also got the below error for which I don't know the exact reason as I mentioned in my previous post
TNS-00505: Operation timed out
    nt secondary err code: 0
    nt OS err code: 0
2022-03-18T18:14:25.706649+00:00
Errors in file /u02/app/oracle/diag/rdbms/xxx04/XXX041/trace/XXX041_ora_269420.trc:
ORA-17627: ORA-01017: invalid username/password; logon denied
ORA-17629: Cannot connect to the remote database server
2022-03-18T18:14:25.706956+00:00
********************************************************
Undo Create of Pluggable Database PRD01 with pdb id - 3.
********************************************************
We can see the nt secondary err code: 0 meaning the timeout is caused due to listener or database and not external such as server or firewall settings. 
The connect string has the scan listener as with all the RAC tnsnames.ora settings and hence we can now check whether all the IPs are pinging in round robin and the VIPs are working fine. As expected all are working fine. A simple select from dual from the source also throws time out error at random frequency. 
SQL> select sysdate from dual@QA01;

select sysdate from dual@QA01
                         *
ERROR at line 1:
ORA-12170: TNS:Connect timeout occurred

SQL> /

SYSDATE
---------
18-MAR-22
So where are we getting the time out from? 

Upon further checking the local_listener parameter in the source database, I find some odd settings in one of the node as below. The other node has proper setting
SQL> sho parameter local

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
local_listener                       string      (ADDRESS=(PROTOCOL=TCP)(HOST=1
                                                 0.195.161.228)(PORT=1521)), (A
                                                 DDRESS=(PROTOCOL=TCP)(HOST=10.
                                                 195.161.228)(PORT=1522)),(ADDR
                                                 ESS=(PROTOCOL=TCPS)(HOST=10.19
                                                 5.161.228)(PORT=2484))
I then see the listener status to figure out what ports it is configured to listen
[oracle@s11 ~]$ lsnrctl status|more

LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 20-MAR-2022 19:08:12

Copyright (c) 1991, 2021, Oracle.  All rights reserved.

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 19.0.0.0.0 - Production
Start Date                13-FEB-2022 09:20:10
Uptime                    35 days 9 hr. 48 min. 1 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/19.0.0.0/grid/network/admin/listener.ora
Listener Log File         /u01/app/grid/diag/tnslsnr/s11/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=10.195.161.226)(PORT=2484)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.195.161.225)(PORT=1521)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.195.161.226)(PORT=1521)))
Services Summary...
Service "+APX" has 1 instance(s).
  Instance "+APX1", status READY, has 1 handler(s) for this service...
Service "+ASM" has 1 instance(s).
  Instance "+ASM1", status READY, has 1 handler(s) for this service...
  ...
  ...
  ...
As you can clearly see here, listener is listening only on port 1521 and 2484
So now, I reset the local_listener parameter to remove the setting with port 1522 which the listener does not listen to. This second port is a mess up caused as we were trying to add another listener for a PoC and the cleanup was not proper after the removal of the configuration. 
SQL> alter system set local_listener="(ADDRESS=(PROTOCOL=TCP)(HOST=10.195.161.226)(PORT=1521)),(ADDRESS=(PROTOCOL=TCPS)(HOST=10.195.161.226)(PORT=2484))" sid='XXX041';

System altered.

SQL> sho parameter local

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
local_listener                       string      (ADDRESS=(PROTOCOL=TCP)(HOST=1
                                                 0.195.161.226)(PORT=1521)),(AD
                                                 DRESS=(PROTOCOL=TCPS)(HOST=10.
                                                 195.161.226)(PORT=2484))

After resetting the local_listener in the source database, now I'm able to clone the PDB without any issues. 
SQL> CREATE PLUGGABLE DATABASE PRD01 from QA01@QA01 keystore identified by xxxxx no data;

Pluggable database created.

SCAN listener hand over the connection to the VIP of the node in the cluster and in turn will be transferred to local listener. Here since the local listener is also configured with port 1522 which didn't exist, we got the random timeouts whenever the connection handoff done to 1522. While configuring parameters utmost care has to be taken to avoid such issues. 

Happy troubleshooting...!!!

Monday, January 17, 2022

Wrong ORA error during PDB cloning!

 Last week, while trying to clone a PDB under a CDB on one cluster (non-prod) to another PDB under different CDB on a different cluster (prod), I encountered a series of errors where the first 2 are what someone can expect but the last error is something I couldn't figure out how it got generated. Let's take a look into it in detail


Set up: 

  • Updated tnsnames.ora file on all the nodes of the cluster
  • Created db link from Prod to non prod CDB to connect to pdb
  • Test the connection

QA01=
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(Host = scanname.example.com)(Port = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = QA01_SVC.example.com)
      )
    )

So far, so good. We are able to connect to the PDB via DB link and hence we can progress to clone the PDB from source CDB to target CDB.
SQL> CREATE PLUGGABLE DATABASE PROD01 from QA01@QA01 keystore identified by xxx;
CREATE PLUGGABLE DATABASE PROD01 from QA01@QA01 keystore identified by xxx
*
ERROR at line 1:
ORA-17628: Oracle error 46659 returned by remote Oracle server
ORA-46659: master keys for the given PDB not found
 
With reference to MOS Doc id: 2778618.1, to resolve the above error when executing CREATE PLUGGABLE statement add " including shared key" and the end of statement.
Here are some series of errors that takes place..
SQL> CREATE PLUGGABLE DATABASE PROD01 from QA01@QA01 keystore identified by xxx including shared key;
CREATE PLUGGABLE DATABASE PROD01 from QA01@QA01 keystore identified by xxx including shared key
*
ERROR at line 1:
ORA-65169: error encountered while attempting to copy file
+DATAC3/CDB01/CE3C246E805F6F3AE053D8A1C30A498C/DATAFILE/libraryd.2661.1085857629
ORA-17627: ORA-12170: TNS:Connect timeout occurred
ORA-17629: Cannot connect to the remote database server
  
Cool, now we at least have a different error than the above. From alert log.. 
Fatal NI connect error 12170, connecting to:
 (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(Host=scanname.example.com)(Port=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=QA01_SVC.example.com)(CID=(PROGRAM=oracle)(HOST=pd07)(USER=oracle))))

  VERSION INFORMATION:
        TNS for Linux: Version 19.0.0.0.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production
  Version 19.12.0.0.0
  Time: 12-JAN-2022 18:21:25
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 0
    nt OS err code: 0
2022-01-12T18:21:25.948274+00:00
Errors in file /u02/app/oracle/diag/rdbms/***/****/trace/****_p000_82225.trc:
ORA-17627: ORA-12170: TNS:Connect timeout occurred
ORA-17629: Cannot connect to the remote database server
 
There are no nt secondary err code which means this is purely timeout from the database. We have not set any timeout value and the default INBOUND_CONNECT_TIMEOUT of 60 seconds kicked in and the connection timed out. No issues with this. All cool. 
Just after receiving the timeout, I made the second attempt. 
SQL> CREATE PLUGGABLE DATABASE PROD01 from QA01@QA01 keystore identified by xxx including shared key;
CREATE PLUGGABLE DATABASE PROD01 from QA01@QA01 keystore identified by xxx including shared key
*
ERROR at line 1:
ORA-65169: error encountered while attempting to copy file
+DATAC3/CDB01/CE3C246E805F6F3AE053D8A1C30A498C/DATAFILE/ddtbsd.2636.1085857609
ORA-17627: ORA-01017: invalid username/password; logon denied
ORA-17629: Cannot connect to the remote database server
 
This attempt turns out to be weird as the same command just a minute later throws out a totally weird error of ORA-01017. If we have got the error in the first attempt, I might have suspected the password provided and rechecked the password but now the password which worked properly while testing is giving us ORA-01017. 
Alert log says.. 
2022-01-12T18:23:27.300138+00:00
Errors in file /u02/app/oracle/diag/rdbms/***/***/trace/***_p004_82257.trc:
ORA-17627: ORA-01017: invalid username/password; logon denied
ORA-17629: Cannot connect to the remote database server
2022-01-12T18:24:26.097107+00:00
Thread 7 advanced to log sequence 94 (LGWR switch),  current SCN: 37359630199069
  Current log# 26 seq# 94 mem# 0: +DATAC4/CDB0101/ONLINELOG/group_26.2116.1092461287
2022-01-12T18:24:29.260711+00:00
Deleted file +DATAC4/CDB01/D566C47DE983511DE05396A1C30A5D5E/DATAFILE/libraryd.2124.1093803079.
2022-01-12T18:24:29.287000+00:00
Deleted file +DATAC4/CDB01/D566C47DE983511DE05396A1C30A5D5E/DATAFILE/users.2133.1093803079.
2022-01-12T18:24:29.408203+00:00
Deleted file +DATAC4/CDB01/D566C47DE983511DE05396A1C30A5D5E/DATAFILE/pltbsd.2046.1093803079.
2022-01-12T18:24:29.514934+00:00
Deleted file +DATAC4/CDB01/D566C47DE983511DE05396A1C30A5D5E/DATAFILE/pptbsx.2050.1093803079.
2022-01-12T18:24:29.547221+00:00
Deleted file +DATAC4/CDB01/D566C47DE983511DE05396A1C30A5D5E/DATAFILE/cltbsd.2126.1093803079.
2022-01-12T18:24:29.558701+00:00
Deleted file +DATAC4/CDB01/D566C47DE983511DE05396A1C30A5D5E/DATAFILE/dmtbsx.2131.1093803079.
2022-01-12T18:24:29.610720+00:00
**************************************************************
Undo Create of Pluggable Database PROD01 with pdb id - 3.
**************************************************************
Checking the trace file associated with the error.. There are multiple session id with the same process id, so taking a deeper look reveals the below.
...
... trimmed o/p ...

*** 2022-01-10T16:28:38.119788+00:00 (CDB$ROOT(1))
*** SESSION ID:(1188.31010) 2022-01-10T16:28:38.119813+00:00
*** CLIENT ID:() 2022-01-10T16:28:38.119816+00:00
*** SERVICE NAME:(pdb01) 2022-01-10T16:28:38.119819+00:00
*** MODULE NAME:(sqlplus@pd07 (TNS V1-V3)) 2022-01-10T16:28:38.119823+00:00
*** ACTION NAME:() 2022-01-10T16:28:38.119826+00:00
*** CLIENT DRIVER:() 2022-01-10T16:28:38.119829+00:00
*** CONTAINER ID:(1) 2022-01-10T16:28:38.119832+00:00

OSSIPC:
IPCDAT DESTROY QP for qp 0x18b96e48 ep 0x18b96cd0 drain start time 146888533
OSSIPC:IPCDAT DESTROY QP for qp 0x18b96e48 ep 0x18b96cd0 drain complete num reqs drained 0 drain end time 146888534 drain time 1 msec time slept 0

... trimmed o/p ...

*** 2022-01-12T17:47:46.797793+00:00 ((6))
*** SESSION ID:(1191.49246) 2022-01-12T17:47:46.799247+00:00
*** SERVICE NAME:() 2022-01-12T17:47:46.799261+00:00
*** MODULE NAME:(sqlplus@pd07 (TNS V1-V3)) 2022-01-12T17:47:46.799264+00:00
*** ACTION NAME:() 2022-01-12T17:47:46.799268+00:00
*** CONTAINER ID:(6) 2022-01-12T17:47:46.799274+00:00

Using key:
 kcbtek structure 0x7ffd3b4751e0
 utsn: 0x10030000001e (4099/30), alg: 0 keystate: 0 inv?: 0 usesalt?: 0 enctbs?: 0 obf?: 0 keyver: 4294967295 fbkey?: 0 fullenc?: 0 frn?: 0 rcv?: 0 skipchk?: 0 use_for_dec?: 0
 encrypted key: 0000000000000000000000000000000000000000000000000000000000000000
 mklocact 0 mkloc 0, mkid: 00000000000000000000000000000000 kcl: [0x7f344f3a15b8,0x7f3446627b30] kverl: [0x7f3446627af0,0x7f3446627af0]
kpdbfRekeyFileBlock: decrypting block file 100 block 56737 (<6/30, 0xdda1) failed with error 3.  Continue with operation..
kcbtse_encdec_tbsblk: WARNING cannot decrypt encrypted block since a valie key is not found. <4099/30, 0xdda2 (afn 100 block 56738)>

... trimmed o/p ...

 encrypted key: 0000000000000000000000000000000000000000000000000000000000000000
 mklocact 0 mkloc 0, mkid: 00000000000000000000000000000000 kcl: [0x7f344f3a15b8,0x7f3446627b30] kverl: [0x7f3446627af0,0x7f3446627af0]
kpdbfRekeyFileBlock: decrypting block file 100 block 109895 (<6/30, 0x1ad47) failed with error 3.  Continue with operation..
OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied

*** 2022-01-12T18:23:27.299715+00:00 (CDB$ROOT(1))
*** SESSION ID:(1191.47408) 2022-01-12T18:23:27.299745+00:00
*** SERVICE NAME:(SYS$USERS) 2022-01-12T18:23:27.299749+00:00
*** MODULE NAME:(sqlplus@pd07 (TNS V1-V3)) 2022-01-12T18:23:27.299753+00:00
*** ACTION NAME:() 2022-01-12T18:23:27.299757+00:00
*** CONTAINER ID:(1) 2022-01-12T18:23:27.299760+00:00

'
ORA-17627: ORA-01017: invalid username/password; logon denied
<error barrier> at 0x7ffd3b473998 placed ksrpc.c@5166
ORA-17629: Cannot connect to the remote database server
<error barrier> at 0x7ffd3b47cf80 placed kpoodr.c@237
OSSIPC:
IPCDAT DESTROY QP for qp 0x18bc3ce8 ep 0x18b50e90 drain start time 472374487

*** 2022-01-14T13:03:35.743519+00:00 (CDB$ROOT(1))
*** SESSION ID:(1195.64486) 2022-01-14T13:03:35.743542+00:00
*** SERVICE NAME:() 2022-01-14T13:03:35.743548+00:00
*** MODULE NAME:() 2022-01-14T13:03:35.743552+00:00
*** ACTION NAME:() 2022-01-14T13:03:35.743557+00:00

OSSIPC:IPCDAT DESTROY QP for qp 0x18bc3ce8 ep 0x18b50e90 drain complete num reqs drained 0 drain end time 472374494 drain time 7 msec time slept 0
OSSIPC:
IPCDAT DESTROY QP for qp 0x18b5f038 ep 0x18b5ee10 drain start time 472374537
OSSIPC:IPCDAT DESTROY QP for qp 0x18b5f038 ep 0x18b5ee10 drain complete num reqs drained 0 drain end time 472374537 drain time 0 msec time slept 0
Process termination requested for pid 82257 [source = rdbms], [info = 2] [request issued by pid: 199810, uid: 1001]
The trace file is updated with Timestamp when a new SESSION ID is logged with the same process id. We can see that there is already a session started at 2022-01-12T17:47:46 (line 19) with SESSION ID:(1191.49246) and our actual session of interest is logged at 2022-01-12T18:23:27 (line 41) with SESSION ID:(1191.47408). Match this timestamp with the timestamp of alert log when we issued the clone command and also see the serial# differs with both the sessions. 
Interestingly we can see the single quote (') started at line 39 to quote the error message but it didn't complete there and ended at line 48 after our clone session (SESSION ID:(1191.47408)) is established which then throws the error from the previous session on the current session. This might be a bug from Oracle software but I didn't raise a case with Oracle.
We can also see the clone is half way done before the error was encountered with the line "Undo Create of Pluggable Database PROD01 with pdb id - 3". All the files that were cloned are getting deleted once the error is encountered so I think this is definitely not an issue with invalid username/password

I didn't do anything after the error I received and left the session as it is as it was late night for me. Next day in the same session, I issued the same exact command
SQL> CREATE PLUGGABLE DATABASE PROD01 from QA01@QA01 keystore identified by xxx including shared key;

Pluggable database created.
This time in the 3rd attempt, the pluggable database created without any issues with the same command issued in the previous 2 attempts. 
So, when Oracle throws out weird error it's wise to investigate the alert log and associated trace file to see if the error thrown is genuine. This doesn't happen frequently but in my case it happened. 

So, is this a bug? I'm not sure as I didn't raise a case with Oracle as I could get past the error and complete my task of cloning. 

References: 

Happy troubleshooting...!!!