Thursday, December 29, 2022

Troubleshooting Kerberos and Oracle CMU configuration

 In one of my previous posts, I have shared complete steps to configure Oracle Centrally Managed Users (CMU) which was completely done in my personal lab environment. Though I have encountered few hiccups, I was able to set it up without big issues. 

Now I'm trying to perform a proof of concept of the same set up in my work environment where I was stuck with several different issues and crossing them one after the other. As I was working with Oracle support and trying to iron out all issues, I learnt a few things related to AD, LDAP, etc.,.


In this post, I'll share the troubleshooting steps for the issues encountered so that it could help fellow DBAs to set up in their environment. 

The following support note will be of much help but its very limited in identifying the exact issue. We can use this note as starting point to dig deeper. 

Kerberos Troubleshooting Guide (Doc ID 185897.1)

Ok, let's start with the different troubleshooting steps. This is not step by step approach and any step can be used any time as per need basis

1. Tracing:

We might just get random errors like ORA-12638, ORA-12631, etc., but it lead you to nowhere. So need to trace the session in order to work with Oracle support and provide them with what they are looking for. So how do you trace the session? 

We need to set the below parameters in sqlnet.ora in both database server (if server trace is needed) and client machine (usually this will be mandatory to have). 

TRACE_UNIQUE_SERVER = ON
TRACE_LEVEL_SERVER = 16
TRACE_DIRECTORY_SERVER = /oracle/ABC/19.0.0/network/log
TRACE_FILE_SERVER = Server_Trace
TRACE_TIMESTAMP_SERVER = ON
DIAG_ADR_ENABLED=OFF
TRACE_UNIQUE_CLIENT = ON
TRACE_LEVEL_CLIENT = 16
TRACE_DIRECTORY_CLIENT = /oracle/ABC/19.0.0/network/log
TRACE_FILE_CLIENT = Client_Trace
TRACE_TIMESTAMP_CLIENT = ON
DIAG_ADR_ENABLED=OFF
Once tracing is completed, we can switch the parameter TRACE_LEVEL_SERVER or TRACE_LEVEL_CLIENT to OFF which will stop tracing of sessions. Make sure you just turn on and turn off quickly once you test your connection as this is instance wide and all the sessions will be traced filling up your log location/disk quickly. 

You can identify the trace file by looking into the sqlnet.log file, just giving a sample log content from the sqlnet.log
***********************************************************************
Fatal NI connect error 12631, connecting to:
 (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=xxxxx.com)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ABC)(CID=(PROGRAM=sqlplus)(HOST=xxxxx)(USER=oracle))))

  VERSION INFORMATION:
        TNS for Linux: Version 19.0.0.0.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production
  Version 19.16.0.0.0
  Time: 23-DEC-2022 19:34:37
  Tracing to file: /oracle/ABC/19.0.0/network/log/client_trace_3655062.trc
  Tns error struct:
    ns main err code: 12631
    TNS-12631: Username retrieval failed
    ns secondary err code: 0
	
***********************************************************************
The logfile is indicated for the error that we have received. We can provide this log file to Oracle support for further review

2. Generating krb5 trace:

Below commands would help generate krb5 trace where we can trace actions related to okinit 

a) Set krb5_trace parameter

export KRB5_TRACE=<path>/krb5trace.trc
echo $KRB5_TRACE

b) On the database server run

$ okdstry
$ okinit <xxx>
$ oklist
$ sqlplus /@<connect_identifier>

This will generate krb5trace.trc file which can be uploaded to support for their analysis

3. Use of same case for principal name:

While creating SPN for database and generating keytab file, make sure you are using small case for creation and configuration 

I have used 
ktpass -princ ORACLE/linux-8.selvapc.com@ORABLISS.com -pass Qwerty123# -mapuser oserv@orabliss.com -crypto ALL -ptype KRB5_NT_PRINCIPAL -out Z:\linux-8-database.keytab

This resulted in error in my PoC environment. 

Instead use small case for database SPN
ktpass -princ oracle/linux-8.selvapc.com@ORABLISS.com -pass Qwerty123# -mapuser oserv@orabliss.com -crypto ALL -ptype KRB5_NT_PRINCIPAL -out Z:\linux-8-database.keytab  

sqlnet.ora should also have principal name in small case. 

SQLNET.AUTHENTICATION_KERBEROS5_SERVICE=oracle

4. ldapbind:

The ldapbind command-line tool enables you to see whether you can authenticate a client to a server. When we set up CMU and if we are getting error, we can check whether the client machine is getting authenticated with the server by using the command as below.

ldapbind -h <AD Host> -p 389 -D <value of ORACLE.SECURITY.DN> -w <value of ORACLE.SECURITY.PASSWORD>
ldapbind -h <AD Host> -p 389 -D <value of ORACLE.SECURITY.USERNAME> -w <value of ORACLE.SECURITY.PASSWORD>
ldapbind -h <AD Host> -p 636 -U 2 -W "file:<wallet location>" -P <wallet password> -D <value of ORACLE.SECURITY.DN> -w <value of ORACLE.SECURITY.PASSWORD>
ldapbind -h <AD Host> -p 636 -U 2 -W "file:<wallet location>" -P <wallet password> -D <value of ORACLE.SECURITY.USERNAME> -w <value of ORACLE.SECURITY.PASSWORD>

The first 2 test will test authentication with TCP and the next 2 will test authentication with SSL which we would have already set up in CMU. 

Example given below: 
$ ldapbind -h xxx.xxx.example.com -p 389 -D "CN=SV-oraservice,OU=Service,OU=Accounts,DC=xxx,DC=xxx,DC=example,DC=com" -w '%*9&nnh!!GFulh!G'
bind successful
$ ldapbind -h xxx.xxx.example.com -p 389 -D SV-oraservice -w '%*9&nnh!!GFulh!G'
bind successful
$ ldapbind -h xxx.xxx.example.com -p 636 -U 2 -W "file:/oracle/grid/19.0.0/network/admin/cmu_wallet" -P W9&nnhYT# -D "CN=SV-oraservice,OU=Service,OU=Accounts,DC=xxx,DC=xxx,DC=example,DC=com" -w '%*9&nnh!!GFulh!G'
bind successful
$ ldapbind -h xxx.xxx.example.com -p 636 -U 2 -W "file:/oracle/grid/19.0.0/network/admin/cmu_wallet" -P W9&nnhYT# -D SV-oraservice -w '%*9&nnh!!GFulh!G'
bind successful
$
If there is any issue with the connectivity, we will be getting a different error. For eg, let me try with a wrong password for the service account..
$ ldapbind -h xxx.xxx.example.com -p 389 -D "CN=SV-oraservice,OU=Service,OU=Accounts,DC=xxx,DC=xxx,DC=example,DC=com" -w 'wrongpass'
ldap_bind: Invalid credentials
ldap_bind: additional info: 80090308: LdapErr: DSID-0C090439, comment: AcceptSecurityContext error, data 52e, v4563
$
You can see when I provided a wrong password, I got the Invalid credentials error as expected. 

5. LDAP force sync

When there is change in the AD side such as password change for the service account, Oracle sometimes won't sync up with the changes and will throw error. We need to force sync Oracle with the LDAP services using the below parameters. 
Note: This is needed only when you doubt that Oracle is not synced up with AD/LDAP 
SQL> alter system set "_ldap_config_force_sync_up"=TRUE;

System altered.

SQL> alter system set "_ldap_config_force_sync_up"=FALSE;

System altered.

SQL> exit

$ sqlplus /@<connect string> 
 
We just need to set _ldap_config_force_sync_up TRUE and FALSE and retry the connection to the database. 

6. Viewing available certificates in AD server

Usually certificates are provided by issuing authority or if it's internal as in this case, AD team will provide you with the certificates. In my case, if I need to check what are the certificates I might need from AD team for the configuration to work, we can check using the openssl -showcert command. 

The command is as follows 
openssl s_client -connect <AD host>:636 -showcerts
$ orapki wallet display -wallet .
Oracle PKI Tool Release 19.0.0.0.0 - Production
Version 19.4.0.0.0
Copyright (c) 2004, 2021, Oracle and/or its affiliates. All rights reserved.

Requested Certificates:
User Certificates:
Oracle Secret Store entries:
ORACLE.SECURITY.DN
ORACLE.SECURITY.PASSWORD
ORACLE.SECURITY.USERNAME
Trusted Certificates:
Subject: CN=EXAMPLE MSPKI XXX Root CA
$
$ openssl s_client -connect abcdef12.xxx.xxx.example.com:636 -showcerts
CONNECTED(00000003)
depth=1 DC = com, DC = example, DC = xxx, DC = xxx, CN = qasmime95
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = ABCDEF12.xxx.xxx.example.com
verify return:1
---
Certificate chain
 0 s:CN = ABCDEF12.xxx.xxx.example.com
   i:DC = com, DC = example, DC = xxx, DC = xxx, CN = qasmime95
-----BEGIN CERTIFICATE-----
MIIFuDCCBKCgAwIBAgITbwAABIUbCx91qS38VQAAAAAEhTANBgkqhkiG9w0BAQsF
ADB2MRMwEQYKCZImiZPyLGQBGRYDY29tMRkwFwYKCZImiZPyLGQBGRYJaG9uZXl3
...
...
...
ZA8ptBPzUXSgSNI+5wfh3F58Tfoi8Ig07ryyR4HD3WKGDE2DDSmMdnZ8oB2cBnuJ
T1RQK8+16GKI1bQFoNRfbwZZXmOjlIJiESZvUg==
-----END CERTIFICATE-----
 1 s:DC = com, DC = example, DC = xxx, DC = xxx, CN = qasmime95
   i:CN = EXAMPLE QMSPKI XXX Root CA
-----BEGIN CERTIFICATE-----
MIIGJDCCBAygAwIBAgITfQAAAAvG9D//4L986AAAAAAACzANBgkqhkiG9w0BAQsF
ADAsMSowKAYDVQQDEyFIb25leXdlbGwgUU1TUEtJIEludGVybmFsIFJvb3QgQ0Ew
...
...
...
ZNaJmBGIBZsLM6IrUw/WGAdNdPQUbcXMBdlSGY5DaGYMGNjnoOqPHQ==
-----END CERTIFICATE-----
---
Server certificate
subject=CN = ABCDEF12.xxx.xxx.example.com

issuer=DC = com, DC = example, DC = xxx, DC = xxx, CN = qasmime95
...
...
 
$
This check is required if you suspect that the provider team has given a prod certificate in place of non prod certificate or vice versa. You can see from above that the wallet has MSPKI XXX Root CA whereas the AD server has QMSPKI XXX Root CA. So we have an incorrect certificate in the wallet and hence it needs to be removed and proper certificate needs to be imported. 
Note: O/p of the command has been trimmed to include only needed content. 

7. Check your account availability

I know this can be an odd situation as to check our account availability but its better to check whether the account that you are trying to access is created/available in the AD and is binded to the service account. We can use ldapsearch command to check for the account existence with all the account details 

Command is as below 
ldapsearch -R -h <AD host> -p 389 -b "DC=xxxx,DC=xx,DC=com" -D <ad super user> -w "<ad user password>" "sAMAccountName=<AD user>"
$ ldapsearch -R -h abcdef12.xxxxx.xx.example.com -p 389 -b "DC=xxxxx,DC=xx,DC=example,DC=com" -D SV-oraservice -w 'password' "sAMAccountName=h368225"
CN=Selvakumar Nagulan,OU=Accounts,OU=Tier1,OU=Admin,DC=xxxxx,DC=xx,DC=example,DC=com
objectClass=top
objectClass=person
objectClass=organizationalPerson
objectClass=user
cn=Selvakumar Nagulan
...
...
 
$
Note that we use our service account as super user to check whether our DB user is properly bind to super user or not. You will get a blank response if the user is not bind or not configured. 



So that's all for now as I was able to successfully configure CMU after working with Oracle support with 2 different support tickets and internal AD team spanning over a month. You might connect with me on any issues regarding setting up CMU in your organization and I would be happy to help with my experiences troubleshooting the same. 

References: 

Oracle Centrally Managed Users (CMU) configuration

Happy Troubleshooting!!! 
 
 
 
 
 
 
 
 

Monday, December 12, 2022

Performance slowness due to Spatial component

 We had a complaint from application team that their database is running slow after the quarterly patching. We had patched many databases and only one database is complaining on slowness. The environment is Exacc and the database is a container (running version 19.16) which has multiple PDBs. So patching couldn't be a reason is the initial thought. So we stepped into investigate what is causing the slowness. 

As a first step, we pulled an AWR report for the time frame where they did some DB operations. 



We can see from above that most of the waits are related to Cluster and topping them is gc cr multi block mixedgc cr block 2-way and gc buffer busy acquire

So this means some query is running very frequently accessing 1 or 2 table blocks continuously. We need to figure out which query is running, accessing what table/index and how frequent and what's the elapsed time of the query. These information will get us to move forward. 

Since the waits are related to Cluster, let's see SQL ordered by Cluster Wait Time (Global) section under SQL Statistics



We see 2 queries that contribute to the most cluster waits. Both these are related to spatial component as the queries are as below and they had ~ 53k executions in the report
SELECT /* chk_journal_reg */ rowid FROM "MDSYS"."SDO_TXN_JOURNAL_REG" WHERE sid = :1 AND sdo_txn_idx_id = :2 AND indexptniden = :3 AND operation = :4 AND rid = :5

SELECT /* sel_dist_upd */ ROWID, ....., ins_no, del_no FROM "MDSYS"."SDO_DIST_METADATA_TABLE" WHERE sdo_index_owner = :own AND sdo_index_name = :nam AND sdo_partition_name = :ptn FOR UPDATE

Ok, these 2 queries look simple and not complicated, so what's causing this buffer busy issues? Investigation went in following direction

Check size of tables: They are too small. 

Check the execution plan of the first code with high cluster waits gave us the clue. Though we have primary key defined on all five columns (SID, SDO_TXN_IDX_ID, INDEXPTNIDEN, OPERATION, RID), the query took the path of INDEX STORAGE FAST FULL SCAN which would traverse through all the blocks unnecessarily. The query took around 4 second per execution. 

I hint the query to use the index already available (/*+ index(SDO_TXN_JOURNAL_REG SYS_C0014253) */) - this resulted in a much better plan and execution time. Execution time reduced to milliseconds. Remember this query executed around 53k times which would make a big difference in application performance. While checking for any plan change, I found that the index path was used until the patch was applied and the plan flipped

SQL> select distinct sql_id, plan_hash_value, id, operation, options
  2* from dba_hist_sql_plan where sql_id='fw4q2s9j3bckt' order by plan_hash_value, id;

          SQL_ID    PLAN_HASH_VALUE    ID           OPERATION                   OPTIONS
________________ __________________ _____ ___________________ _________________________
fw4q2s9j3bckt            1765869626     0 SELECT STATEMENT
fw4q2s9j3bckt            1765869626     1 INDEX               STORAGE FAST FULL SCAN
fw4q2s9j3bckt            2134037127     0 SELECT STATEMENT
fw4q2s9j3bckt            2134037127     1 INDEX               UNIQUE SCAN

SQL> 
So, we decided to create sql profile so that the plan stick to plan_hash_value 2134037127 using note Encouraging CBO to Pickup a Better Execution Plan Using the COE XFR SQL Profile Script (Doc ID 1955195.1)

In the mean time, Oracle support note (After Applying the 19.16 DBRU, Scheduler Jobs on Spatial Tables Take a Long Time to Execute (Doc ID 2907800.1)) mentioned a bug related to this same table and query. 

Reason for slowness is 

Running a MV refresh using the DBMS_SCHEDULER package causes spatial to use the SDO_TXN_JOURNAL_REG transaction journal table, instead of the SDO_TXN_JOURNAL_GTT transaction journal table.

So, now the solution is to revert the sql profile created and apply spatial bundle bug fix Patch 34725493 or to apply RU 19.17

Since we already had applied another one off patch for a previous issue with Spatial, we couldn't apply this bug fix patch as it was conflicting with the one off patch. Raised an SR with Oracle to get the merge patch for both the conflicting patches but Oracle couldn't provide any commitment on when the merge patch will be ready.

As we need to get the production database back to normal as soon as we can, we had to roll back the entire 19.16 RU to bring it back to pre patch state (RU 19.15) and then the application verified all the jobs are running without issues. We could also not see any cluster related wait post the patch rollback. We rolled back as 19.17 is yet to be tested in our lower environments before we promote to production.

Interestingly, Oracle maintains a note on all the patches that needs to be applied post RU patch apply under What Is The Latest Spatial Patch Bundle for 12c and Higher Databases? (Doc ID 2514624.1) So every time we apply RU to the database, we need to refer this note and apply the patch that is mentioned for the related RU as well. 

This investigation is not only a trouble shooting experience but also a knowledge gathering experience with Oracle Spatial component regarding the bugs related to Spatial and separate Spatial patch bundle, etc.,. 


Happy Trouble shooting and Patching...!!!