Friday, November 20, 2020

ORA-27301, Oracle Database 12.2 and later running on Linux 7 and later...

There is always something to learn, isn’t there? A few fun notes to share with you from some support related adventures we have had this week….

RHEL 7 and the MTU size of the lo Interface. Database Outages…

Ran into an issue this week that caused an outage of one node of a 2-node Oracle Database 12.2 RAC cluster. The database started to show signs of memory exhaustion. Processes could not be started and other processes were failing. Error messages were appearing in the alert logs faster than hotdogs being consumed in an eating contest.

This database had been experiencing heavy loads, as they had been doing significant load testing on it over a period of several weeks. They had also performed some high intensity operations such as mass data deletes and highly parallelized index rebuild operations. Looking at OEM, we found periods where they had clearly been bounded by CPU – but there was no indication that they were near to memory starvation. 

I kept hearing Margo say, "... and why is the carpet all wet Todd?". 

"I don't KNOW Margo!"

Researching the problem, I discovered that the answer lay not in memory exhaustion, but instead in memory fragmentation. What the MTU?

What the MTU?

Upon some more investigation, we discovered an existing problem involving RHEL Version 7 and Oracle database that has to do with severely fragmented memory. This problem impacts both new processes and existing processes.

  • New processes fail initial requests for memory when starting because there were was not a contiguous chunk of memory to satisfy the allocation request.
  • Existing processes would fail when they tried to allocate additional memory and a contiguous chunk of memory was not available to satisfy the additional allocation request.

It turns out that this problem has to do with the MTU size of the lo adapter. By default the MTU size of this interface is set to 65536. If you look in Bugzilla, it appears that this problem has been around for some time in the Linux kernel, but it appears that it’s not impacted Oracle database until RHEL 7. I can’t find any information that indicates what has changed, but I’d guess something in the Linux kernel changed in RHEL 7 that’s now impacting Oracle.

Reducing the MTU?

The solution to this problem is to reduce the MTU size on the interface from the default 65536 to a value of 16384. This new value of 16384 is the recommended value for RHEL 7 running Oracle database.

You can see the current MTU size for lo by running the netstat command as seen here:

root@localhost ~]# netstat -i

Kernel Interface table
Iface             MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0             1500    50796      0      0 0         35754      0      0      0 BMRU
lo              65536   199576      0      0 0        199576      0      0      0 LRU
virbr0           1500        0      0      0 0             0      0      0      0 BMU

Use the ifconfig command to change the MTU size of the lo interface as seen here:

[root@localhost ~]# ifconfig lo mtu 16384

And we see the MTU size has changed:

[root@localhost ~]# netstat -i
Kernel Interface table
Iface             MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0             1500    50796      0      0 0         35754      0      0      0 BMRU
lo              16384   199590      0      0 0        199590      0      0      0 LRU
virbr0           1500        0      0      0 0             0      0      0      0 BMU

A couple of notes. You will need root privileges to execute the ifconfig command to change the adapter setting. The netstat -i setting will work from most Linux accounts, such as the grid or oracle software owning accounts.

How to Tell if there is Fragmentation of Memory

So, you might wonder…. “How do I tell if there is memory fragmentation? Can I monitor for it?”. I’m glad you asked! It actually leads to my next topic – the Magic SysRQ Key in Linux!

The Magic SysRQ Key in Linux – Do you know about it?

So, how do you figure out how to determine the extent of memory fragmentation in Linux 7. Well, the Magic SysRq key functionality in Linux came to the rescue! Not aware of this functionality? Well, now you are! 

You can read about the SysRq functionality here! You can read how I used it to look at memory fragmentation in Linux here

Here is an example of dumping the current memory information, including fragmentation. The results will be in /var/log/messages:

#echo m > /proc/sysrq-trigger
#See the results…
#dmesg

I won’t fill this email with the length output from dmesg it will provide you with a the memory distribution on a line that looks like this:

[2732874.679539] Node 0 Normal: 15149*4kB (UEM) 10237*8kB (UEM) 3244*16kB (MR) 1*32kB (R) 2*64kB (R) 2*128kB (R) 1*256kB (R) 0*512kB 0*1024kB 1*2048kB (R) 0*4096kB = 197116kB

Note the large number of 4kb pages here, and the lack of bigger chunk sizes…. This might indicate memory fragmentation in the system. Essentially the problem is that you have a lot of small chunks of memory available, but there are few, if any, larger chunks available to service larger memory allocations. 

One possible solution is to force some memory compaction:

#force memory compaction
# echo 2>/proc/sys/vm/compact_memory

(see https://www.kernel.org/doc/Documentation/sysctl/vm.txt for more on using this command to force memory compaction).

Note that once you get to the point that you have started generating ORA_27301 errors, memory compacting probably won't solve the issue. Your probably going to have to restart the impacted database servers to get back in business quickly. Thats right, the server. It's probably not going to be enough to just stop and restart the cluster services. 

Anyway…. Hope you find this useful!