Oracle Tips and Techs: November 2022

If you do a search for antivirus related issues in Oracle support, you will hit many notes talking about AVs causing issues with the Oracle database and this is not specific to any AV but in general to all the AV software.

Taking a look at Oracle support note How To Configure Anti-Virus On Oracle Database Server (Doc ID 782354.1), Oracle mentions the below

we recommend the following files to be excluded from online anti-virus scanning.
Oracle datafiles
Control files
Redo-log files
Archived redo-log files if database is in archive log mode
Files with extension '.ora'
Password file
File in ADR directory
Oracle Software directories (including Oracle_HOME and Oracle base)
/dev/shm on Linux platform

So, is that it? Are we good if we just exclude all the files mentioned above?

I'll discuss regarding the 2 scenarios which caused outages to the database w.r.to AV software (I'm not mentioning the names of the AV software here)

Scenario 1:

We had installed one of the leading AV software on all our DB servers. During the testing phase everything goes smooth. Few files were asked to be excluded as suggested by Oracle and finally agreed to deploy them on production servers after rigorous testing in the lower environments.

Soon after the production deployment, we started having issues with ASM disk writes as the ASM disks were invisible to Oracle. The following error pops up frequently on busy and high load databases.

ORA-00202: control file: '+DATA/ABC/cntrlabc.dbf'~ORA-15081: failed to submit an I/O operation to a disk

ORA-00312: online log 5 thread 1: '+LOG/ABC/ONLINELOG/group_5.261.981173379'~ORA-15081: failed to submit an I/O operation to a disk

So what's really happening here?

We are using udev rules to define the ASM disks on all our environments (old school but still works effectively). Our assumption is the AV software locks or prevents other processes from reading the udev rules files or the ASM disk header exactly during Oracle tries read/write udev rules file or the ASM disk header. The exact cause is not known till now as the incident is not reproducible at will.

To get out of this situation, we need to reload the udev rules and things will get back to normal. This is no joke as the busy database might lose data or the operations will get stalled when the control file/data file is missing from where it has to be.

We worked with Oracle support who couldn't figure out anything specific from the logs provided and also we worked closely with the AV software vendor close to 8 months to trap the error/incident by trying different tracing and testing same workload during same previous issue time, etc., but as I mentioned we couldn't trap the issue and the tracing can't be directly enabled on the production database server for long times due to enormous logs the tracing produces, performance impact, etc.

In the mean time, multiple version updates with minor to major file handling updates w.r.to Oracle database were introduced to the AV software but none of them help prevent the issue from occurring infrequently all of a sudden which made the business people unhappy.

Finally, business agreed to move to another AV vendor 😜

The horror story continues...

Scenario 2:

During the 8 months time where all the R&D was happening with the previous vendor, there comes another AV software vendor with all the super cool features that one system can have to get rid of all the virus (just like the complete dose of COVID vaccine and on top of that a booster dose too).

So just like the previous one, the testing went with flying colors. No outages like disk not visible, etc. was seen and no performance slowness reported in the lower environments and also a pilot was done on one of low risk production database server as well.

Finally, it was decided to take the AV software to production along with other changes we usually do such as OS and DB patching every quarter. The deployment went fine for all the production DB servers with no issues reported during the maintenance window and post maintenance testing window. All are happy, all went to celebrate and sleep peacefully.

Next day, business began as usual.. Time passed slowly.. Application team noticed slight delay in batch processing for one or other jobs.. They still were monitoring thinking it's an one off case and waited for sometime.. Performance is now getting a huge hit on one of the very critical OLTP database and the entire database was slowing down.. we had multiple tickets bombarding and eventually a high priority ticket was assigned and members from different groups (Application team, OS admins, Infra admins, DBAs) jumped on to the bridge to check from their end.

Initial checks with DB showed almost all the queries were running slow. This is not an ideal situation where all the queries would get hit. So the issue seems to be something out of database. Checks on OS shows that run queue for the CPU is too high and the time spent running the kernel is taking high percentage of the CPU. This is definitely unusual.

You can see the load average is near 400 and this system is a 72 core system which translates to a load of ~ 5.5 which is too high. Interestingly, unlike all other days the kernel processing is taking > 50% (ideally should be < 10%) is a hint saying something messed up at the OS/kernel.

So the first thing we did is to check for the AV activity and stop that right there as it's the only alien on the system. OS admin disabled the AV software. Still the CPU usage didn't come down. So we guessed it to be some other thing that's causing the issue. Remember we did OS and Database patching as well. So the plan now is to rollback both the patches one after another to figure out the cause.

Meanwhile, support from vendors such as Oracle, Redhat and SAP (since the application is SAP) were also involved via high priority tickets to each of them. Oracle and SAP clearly mention that the cause might be from OS as the kernel processing is unusually high. So first we rolled back the kernel to previous version and rebooted the entire system. The issue was still persistent. Database patch was then rolled back and still the CPU usage didn't reduce to normal levels.

We then installed perf tool which is not a standard package that comes along with Linux as Redhat needs some finer details to pin point the cause. Take a look here to learn much about perf

With the OS team providing the required details to Redhat support (I have limited knowledge on what traces were provided but guess perf stat trace), Redhat support got back to us with this module event_collector_2_3_xxx which is a third party module got loaded to the kernel and causing the kernel to suck up all the CPU. This module is from the AV software. We have to uninstall completely of the AV software and take a clean reboot to get rid of the module from kernel.

One thing to understand is that though we disabled the AV software as first step of our troubleshooting, the same AV software was identified as culprit causing the slowness issue after very long investigation. This seriously would have been avoided if we know how the AV would behave if we just disable it.

So, what's the solution for this?

Oracle doesn't certify any AV software to work properly with Oracle databases. We can't leave our production databases/servers unprotected. So we can't totally get rid of anti virus softwares at all. In the same time, we can't risk unexpected production downtime as we saw in the above 2 cases of the multiple other probabilities that might occur.

What we can try is other alternatives than relying upon AVs such as below

DB server should be restricted only for DB admin login, specifically only a senior DBA who does admin work (such as startup and shutdown using a sys account) and not the management work like adding/extending tablespaces kind of tasks. These other activities can be done remote using system or customized DBA account. This makes sure very limited logins are available on the server directly unlike we have 1 common account (in most cases oracle) where all the members in team use to login. This poses a very serious security risk
Place all DB servers under a subnet with tight firewalls such as auto log off in x seconds/minutes, trace entire activity and audit frequently on what commands been run, etc. Only SA and DBA should have direct login accounts.
OS login account should be coupled with AD and should use MFA to authenticate. The configuration seems complicated but this is one time configuration that needs the initial efforts to set up and we are done.
To limit direct access to DB, use kerberos/radius authentication or Centrally Managed Users (available from version 18c) which implements coupling enterprise users with AD accounts and groups which would allow users with their enterprise logins. Again this provides double protection as once the member leaves the organization, his/her account is totally wiped out making the server/DB secure
Use TCP.VALIDNODE_CHECKING = YES , TCP.EXCLUDED_NODES, TCP.INVITED_NODES sqlnet parameters to accept connections only from specific ip address or subnet making the application access to database super strong.

These are the few things I can think of to secure the database/server from exploitation which would eliminate or minimize the requirement for an anti virus software.

Let me know in comments of various other implementations that you come across to secure the database/servers in your organization so that we can help each other learn and secure our databases.

References:

How To Configure Anti-Virus On Oracle Database Server (Doc ID 782354.1)

perf Examples

What is TCP_VALIDNODE_CHECKING and How to Use It (Doc ID 462933.1)

What the first five lines of Linux’s top command tell you

Happy Anti Virusing...!!!