Looking for something? Try here..

Tuesday, October 4, 2016

IGNOREDELETES with Oracle GoldenGate SQLEXEC

This is one of the several issues I underwent during my current project involving Oracle GoldenGate.
I'm using the below mapping statement with a lookup query on another table using SQLEXEC.
SQLEXEC (ID LOOKUP_MOI, QUERY "select instanceid from DATASTORE.MANAGEDOBJECTINSTANCE &
where TROUBLEREPORTID=:TR_IN and INSTANCENAME=:I_IN", PARAMS (TR_IN = C911001001, I_IN = C911001102)),& 
COLMAP ( 
ATTRIBUTEID=@COLSTAT (MISSING), 
TROUBLEREPORTID=@IF (@STRFIND (C911001001,'ID',1)=1,@COLSTAT (NULL),C911001001), 
NAME=C911003014, 
VALUE=C911005010, 
TYPE=C911003018, 
PARENTNAME=C911005001, 
INSTANCEID=@IF (@COLTEST (@GETVAL(LOOKUP_MOI.instanceid),NULL,INVALID),@COLSTAT (NULL),@GETVAL(LOOKUP_MOI.instanceid)), 
ENTRYID=C1);
Here when the lookup query executes, for the data it can find, we don't have any issues, but when a record doesn't have any value, for eg. let's say there is no matching where clause values to find, then the query results in ORA-01403 no data found.

Few data were never replaced by the new instanceid value or NULL during update operations. This is not an expected behaviour and hence in order to find which record's data the query is not able to find or to confirm that the query is actually looked up to result in ORA-01403, I had to turn on reporting of the error records as instanceid is a NOT NULL column. Reporting of the records erring can be turned on by using ERROR REPORT in the SQLEXEC statement as below.
SQLEXEC (ID LOOKUP_MOI, QUERY "select instanceid from DATASTORE.MANAGEDOBJECTINSTANCE &
where TROUBLEREPORTID=:TR_IN and INSTANCENAME=:I_IN", PARAMS (TR_IN = C911001001, I_IN = C911001102), ERROR REPORT) & 
This makes the GoldenGate process to register the details on the discard file with the records erred. I'm getting the following "VALUE NOT MAPPED" in the discard file for few records.
Stored procedure LOOKUP_MOI failed at seqno 1 rba 144430729 
Missing required param: TR_IN 

Input parameter values... 

TR_IN: * VALUE NOT MAPPED * 
I_IN: stbordel 
TR_IN: * VALUE NOT MAPPED * 
I_IN: * VALUE NOT MAPPED * 
* 
* 
This shouldn't be the case as I'm forcing the source table to log both C911001001 and C911001102 columns. C911001102 is not mapped to any of target table column but used in the lookup query.
GGSCI (xxx00450 as ggs_dba@XXXDB) 5> info trandata aradmin.t539 

Logging of supplemental redo log data is enabled for table ARADMIN.T539. 

Columns supplementally logged for table ARADMIN.T539: C1, C911001001, C911001102. 

Prepared CSN for table ARADMIN.T539: 13678637504852 
Now to check whether the columns are actually logged by the extract process, I use logdump utility to dig the trail file at that particular rba which reveals the values are actually logged.
Logdump 68 >pos 144430729 
Reading forward from RBA 144430729 
Logdump 69 >n 
___________________________________________________________________ 
Hdr-Ind : E (x45) Partition : . (x0c) 
UndoFlag : . (x00) BeforeAfter: A (x41) 
RecLength : 171 (x00ab) IO Time : 2016/09/12 03:44:44.000.000 
IOType : 134 (x86) OrigNode : 255 (xff) 
TransInd : . (x00) FormatType : R (x52) 
SyskeyLen : 0 (x00) Incomplete : . (x00) 
AuditRBA : 50785 AuditPos : 51572240 
Continued : N (x00) RecCount : 1 (x01) 

2016/09/12 03:44:44.000.000 GGSUnifiedUpdate Len 171 RBA 144430729 
Name: ARADMIN.T539 (TDR Index: 2) 
After Image: Partition 12 G b 
0000 0068 0000 0013 0000 000f 3030 3030 3030 3034 | ...h........00000004 
3236 3731 3136 3200 0400 0a00 0000 0674 6d75 7365 | 2671162........tmuse 
7200 0500 0a00 0000 0000 0057 d65c 4a00 0600 0a00 | r..........W.\J..... 
0000 0000 0000 0000 0200 0a00 1300 0000 0f54 5230 | .................TR0 
3030 3030 3137 3137 3838 3537 0017 000c 0000 0008 | 000017178857........ 
7374 626f 7264 656c 0000 0013 0000 000f 3030 3030 | stbordel........0000 
3030 3034 3236 3731 3136 3200 0400 0800 0000 0474 | 00042671162........t 
Before Image Len 108 (x0000006c) 
BeforeColumnLen 104 (x00000068) 
Column 0 (x0000), Len 19 (x0013) 
Column 4 (x0004), Len 10 (x000a) 
Column 5 (x0005), Len 10 (x000a) 
Column 6 (x0006), Len 10 (x000a) 
Column 10 (x000a), Len 19 (x0013) 
Column 23 (x0017), Len 12 (x000c) 

After Image Len 63 (x0000003f) 
Column 0 (x0000), Len 19 (x0013) 
Column 4 (x0004), Len 8 (x0008) 
Column 5 (x0005), Len 10 (x000a) 
Column 6 (x0006), Len 10 (x000a) 
Now, I got confused on why is that the discard file shows or GoldenGate throws that the values are not mapped. Raised an SR to get it clarified but unfortunately Oracle support was dead slow in responding and with no proper updates after uploading all the required param files, report files, discard files, trail log files, etc.

After careful consideration and investigation and comparing the data which were discarded and which were correctly getting the instanceid values, I can get a common pattern on the data as only the records deleted on the source database are having this VALUE NOT MAPPED error. I now got the clue on why this is happening. As per the application requirement, we are not propagating deletes from the source DB to the target DB, target DB being a form of warehousing for the closed tickets. I am using IGNOREDELETES in my replicat parameter file. The records are deleted and hence the fields in the trail has no value on it. :)

Coming to the actual issue on how to resolve the SQLEXEC not to reject the record, according to Oracle GoldenGate documents, when a SQLEXEC encounters an issue/ error, this column should be a "column missing" condition

IGNORE
Causes Oracle GoldenGate to ignore all errors associated with the stored procedure or query and continue processing. Any resulting parameter extraction results in "column missing" conditions. This is the default.


Find details here
So in my above mapping statement, by making INSTANCEID column to check for MISSING instead of NULL or INVALID should fix the issue. Isn't it? 
INSTANCEID=@IF (@COLTEST (@GETVAL(LOOKUP_MOI.instanceid),MISSING),@COLSTAT (NULL),@GETVAL(LOOKUP_MOI.instanceid)),
If you have answered Yes, then we have another twist here..
As there is actually no value mapped for the query to execute, this record is rejected and is not returning NULL as desired.
From Oracle documentation,

PARAMS [OPTIONAL | REQUIRED] parameter_name = {source_column | OGG_function} |NOPARAMS


For Oracle, whether or not a parameter is optional is automatically determined when retrieving the stored procedure definition.
REQUIRED indicates that a parameter value must be present. If the parameter value is not present, the SQL will not be executed.

We can also create a procedure to return NULL if there is an ORA-01403 returned only in case of providing both the required parameters. So this effort will again result in rejection of the record (doing nothing on the record)

The solution is to take the value from the lookup query itself by using the case statement of the select query which I have used in my parameter file as below.
select case when exists (select instanceid from DATASTORE.MANAGEDOBJECTINSTANCE where TROUBLEREPORTID=:TR_IN and INSTANCENAME=:I_IN)   
then (select instanceid from DATASTORE.MANAGEDOBJECTINSTANCE where TROUBLEREPORTID=:TR_IN and INSTANCENAME=:I_IN)   
else NULL end instanceid from dual;
This actually returns either a valid result or a NULL so we don't have a missing column here.. The suggestion is not to leave the record unhandled.
Thanks K.Gan for letting me know this via OTN community discussion here.

The solution seems simple but the effort it took to debug the issue and to derive the actual result is really a painful task. The result is a bliss..

Few things..
The entire investigation including SR update and deriving at the solution/desired result took around a week's time..
When ignoring part of transaction (IGNOREDELETES in our case), be very careful on what might happen to the data and how to deal with it..
Tech forums such as OTN community is your best friend when seeking technical help. Make friends/ Network as much as you can..

Happy Ignoring transactions !! :)

Sunday, October 2, 2016

Oracle GoldenGate Replicat erroring with OCI Error ORA-24369

This post is to talk about an interesting issue that I troubleshooted after the project is moved on to pre-production and in soak period.

All is done with the UAT testing and all the defects were fixed and got the UAT pass signed off. Deployed the code in the production environment. Initial loading of data went on for 2 days without any issues. Now its time for real time sync. 

I started the real time replicat and monitoring the system thinking nothing should go wrong as this was a 3 months non stop, sort of no good sleep hard work project involving huge amount of learning as self and as a team. Periodically checking the status of the process and bang!!! The process got ABENDED.. 

Now started the nightmare. I started digging the issue. First, report file to see what was the issue? Ok, I have issue with some data in a particular table. Now discard file to see which record got discarded? Ok, got it. Check whether it is a valid data to the database as I'm ignoring few orphan records (records without parent records) originating from source system. Well, this is a valid data allowed to the database.

The error I received is as below.
2016-09-26 17:43:54 INFO OGG-01021 Command received from GGSCI: STATS. 

2016-09-26 17:46:02 INFO OGG-01971 The previous message, 'INFO OGG-01021', repeated 1 times. 

2016-09-26 17:46:02 WARNING OGG-00869 OCI Error ORA-24369: required callbacks not registered for one or more bind handles (status = 24369), SQL <UPDATE "DATASTORE"."WORKLOG" x SET x."DETAILS" = EMPTY_CLOB() WHERE x."ENTRYID" = :b0 RETURNING x."DETAILS" INTO :dl0>. 

2016-09-26 17:46:02 WARNING OGG-01004 Aborted grouped transaction on 'DATASTORE.WORKLOG', Database error 24369 (OCI Error ORA-24369: required callbacks not registered for one or more bind handles (status = 24369), SQL <UPDATE "DATASTORE"."WORKLOG" x SET x."DETAILS" = EMPTY_CLOB() WHERE x."ENTRYID" = :b0 RETURNING x."DETAILS" INTO :dl0>). 

2016-09-26 17:46:02 WARNING OGG-01003 Repositioning to rba 1597961 in seqno 0. 

2016-09-26 17:46:04 WARNING OGG-01154 SQL error 24369 mapping ARADMIN.T433 to DATASTORE.WORKLOG OCI Error ORA-24369: required callbacks not registered for one or more bind handles (status = 24369), SQL <UPDATE "DATASTORE"."WORKLOG" x SET x."DETAILS" = EMPTY_CLOB() WHERE x."ENTRYID" = :b0 RETURNING x."DETAILS" INTO :dl0>. 


The below statement is what was registered as abended from the discard file.
UPDATE "DATASTORE"."WORKLOG" x SET x."DETAILS" = EMPTY_CLOB() WHERE x."ENTRYID" = :b0 RETURNING x."DETAILS" INTO :dl0 
Oracle support site didn't help much at my first search and hit only 2 articles related to this error.. So as this is a production issue, opened a severity ticket and provided all the details, report files, parameter files and even the trail files as they requested for investigation.
EMPTY_CLOB() is an Oracle internal package to initialize a CLOB variable and my guess was whether this function is being called by application or by database internal call.
As the ticket is for 24/7 support, the engineers are rotated and from one of the engineer got a clue on how to make the replicat to display the SQL statements as they were applied. As per advice, included the following parameters in the replicat param file and started the replicat from GoldenGate home.

showsyntax
grouptransops 1
maxtransops 1

started the replicat as below from GoldenGate home
$ ./replicat paramfile ./dirprm/<replicat_name>.prm

The showsyntax is used to display the statements that the replicat is going to run against the database and from this method it is found that the actual run varies against the erred query.

This is the exact update statement that the showsyntax produced..
UPDATE "DATASTORE"."WORKLOG" x SET x."DETAILS" = '<LOB Data>' WHERE x."ENTRYID"='000000431066777' 
There is a query transformation as below when the error occurs.
UPDATE "DATASTORE"."WORKLOG" x SET x."DETAILS" = EMPTY_CLOB() WHERE x."ENTRYID" = :b0 RETURNING x."DETAILS" INTO :dl0 
By this time I came across another bug note "Bug 1555887" from Oracle Support site in which the problem statement had the below lines..
In OCI, as long as there's only one row to update, the OCI happy,however, if 
the test case is run once (several iterations), then run again without 
truncating the clob table, the following errors occur if more than one row 
matches the update statement where clause:
 
1st time with 2 rows - onexecute
 
ORA-24369: required callbacks not registered for one or more bindhandles

So now with this clue, I started to look for duplicate data in the table and yes, I have got the duplicate data produced in the table. So this is the root cause that the process got abended.
Now I can't just say this is a duplicate record (source might contain 2 records as well) but to figure out how this data got duplicated, I started the investigation again from application side which is the source for all the data. The application sql logs stated as below.. (Example shown of a different data as logs turned on to see the sequence of operation on the table)
....
....
....
INSERT INTO T433 (C2,C7,C8,C112,C911001009,C911001089,C911001302,C911002100,C911018012,C911018013,C911018014,C911018016,C911018045,C911018046,C911018047,C911018056,C911018057,C911018058,C911018201,C911018254,C911203021,C911203022,C911203023,C3,C5,C6,C1) 
VALUES ('TXGUser',0,'Task # 1.1.1 Assigned to Group N/A',';0;',NULL,'A00','1.1.TR0000019131158',' 1.1.1',EMPTY_CLOB(),'TR0000019131158','Notification','BRE','BRE','Ball','Notification','TXGUser','911SMC','808-827-0515','TSK000076213737','Ball','No','No','No',1475015152,'TXGUser',1475015152,'000000431352674') 
...
SELECT C911018012 FROM T433 WHERE C1 IN ('000000431352674') FOR UPDATE 
...
Setting LOB into the above row ... 
....
....
So this is just a single update happening on a single row. So duplicates from application is ruled out.
Checking in the source database where the application connects, shows no sign of duplicates as the main column (C1) has a unique constraint to avoid duplicates.
Checking on the target system reveals many duplicate data sitting happily occupying space..
SQL> select entryid, count(*) from worklog group by entryid having count(*) >1;

ENTRYID        COUNT(*)
-------------------- ----------
000000431066591               2
000000431066615               2
000000431066617               2
...
...
...
The count of those records was around 2000. So how was this possible?
I used HANDLECOLLISIONS when turning on the real time replicat which has KEYCOLS as ENTRYID for the WORKLOG table.
So is this duplicates still possible? Shouldn't it eliminate the records that were already loaded via initial loading as we have mentioned KEYCOLS in the table mapping?
The answers is in the Oracle support note Doc ID 1350948.1 OGG Can Insert Duplicate Rows When There Is No Unique Index Primary Key UI PK Defined.
Cool, I don't have an unique key defined on the target table which allowed all the data (delta data) to be inserted one more time creating duplicates making the bug to kick in here as well.
So I just removed all the duplicates by running the following block
create table del as select entryid from (select entryid, count(*) from worklog group by entryid having count(*) >1);
DECLARE
   wld        NUMBER;
   lv_stmt2   VARCHAR2 (1000);
BEGIN
   FOR eid IN (SELECT entryid
                 FROM del)
   LOOP
      lv_Stmt2 :=
            'select worklistid from (select * from worklog where entryid= '
         || eid.entryid
         || ' order by 1 desc) where rownum=1';

      EXECUTE IMMEDIATE lv_stmt2 INTO wld;

      DELETE FROM worklog
            WHERE worklistid = wld;
   END LOOP;
END;
/
drop table del purge;
Now to make sure I don't get duplicates again, I just created a unique index on ENTRYID column which is again my key column to do the table operations.
The data then replicated happily..

Moral of the story: Duplication is the worst enemy of the well built system :)

Few things:
To ignore orphan records, I used REPERROR (2291, ignore) in my parameter file.
The bug note mentioned in this blog is said to be fixed in product version 8.1.6.1.4 in the same note whereas my current database version this issue is on 11.2.0.4 with GoldenGate version 12.2.0.1
The whole troubleshooting exercise took around 3 days to fix the issue

Happy troubleshooting!!