Trouble in Exadata City
If you have applied an Exadata patch or two, you know that there are a few gottyas that can lurk in a stab you in the back. I thought I'd provide my top-n list of Exadata Patching Gottyas in a hope that you won't suffer from them.
- Didn't run/address Exacheck findings
- Root FS full or nearly full
- Didn't backup the compute node boot partition.
- Lost passwords
- Node misconfigurations
- Didn't RTFM
- Backups only to Exadata cell disks
- ILOM not working
- Don't know how to use ILOM
- "This is the way we do things" itis.
- The patches are cumulative - Really?
- Forgetting to patch GHomes.
- Forgetting to patch all the OHomes.
- Forgetting to relink Oracle properly (this is a major performance problem).
- Opening SR's without using your Oracle Exadata CSI.
In the next several posts I will address a few of these items in order. As I move along, I might add a few here if they come to me, or are suggested by others.
Note that in my stories, the names have been changed to protect the innocent. Nothing I say will have any relationship to Wikileaks at all, nor will I be revealing any government secrets. I'd tell you that the UFO's are real and that the aliens have landed, but then I'd have to kill you.
So, let's start with the first few:
Didn't run/address Exacheck findings
I've done a few posts on Exacheck. It is, perhaps, one of the most underused tools that the DBA or Exadata Machine administrator, has available to them. Exacheck gets better and better and really should be part of your daily administrative reports in my opinion.
To not run an Exacheck report before you start patching (both in the planning stages and also just before the actual application) and *read* it (yes - there are those who run reports and then don't read them! - I KNOW!!) You should clearly understand the meaning of every item that is listed on that exacheck and why it's showing up on your system. You should document the reoccurring items that you already know are OK (but always check the details in case something small and significant has changed) and deal with those errors or warnings that are new and not on your "safe to ignore" list.
In my opinion, I will not start a patch set application until I'm happy that the Exacheck is clean. That does not mean a score of 100%, that means that I understand the reasons, clearly, for all of the errors and warnings and that you could describe each of those reasons to me, so that I'd understand. If you can't verbalize the reasons for them (short of being mute) then you don't understand them in most cases.
I can't point my finger at any Exadata upgrades that I've been involved in where checking the Exacheck report would have solved a major problem that we ran into. I can point to more than one or two where checking the Exacheck report likely saved us a problem.
Root FS full or nearly full
I've actually run into this more than a few times. I'll find that on one of the compute nodes that the root file system is dangerously close to filling up. In fact, I just had a case where the root FS did fill up, which caused a system panic and rebooted the node.
Monitoring the root file system is, frankly, a basic production monitoring responsibility. If you have Oracle Platinum services and the Platinum gateway - understand that they are not monitoring the size of the root file system for you. They will notice when it fills up and the node crashes, but then it's a bit to late. You should configure monitoring for the space in file systems, including root, on any Oracle Database machine, including Exadata. OEM is the proper place to be doing this monitoring.
Often I find that the Root FS is filled with the following:
1. Files transfered into /tmp - typically Oracle Datapump dump files or sometimes files related to data feeds that were sftp to or from the Exadata Machine.
2. Core dump files that never got cleaned up.
3. Oracle related files such as accidentally creating output files in the root file system and the like.
You should regularly scour your / file system and clean it up. Also, note that Oracle provides a script that you can use to backup the boot partition (
/dev/VGExaDb/LVDbSys1) to a backup partition (
/dev/VGExaDb/LVDbSys2).
You should use that script anytime something major changes on the root file system (for example, you add a new OH).
The script name is dbserver_backup.sh. You can find more information on this script in MOS note 1473002.1. In my mind, I'd also schedule a regular backup of the root file system to some non-Exadata media such as an attached ZFS storage device.
Next time we will cover another 2-3 items on my list. Stay tuned!