Sunday, March 25, 2012

Node Hang on Reboot During Exadata Patching

If you have done Exadata patching few times, you are likely to know the dreaded situation when during the patching, especially the cell nodes refuse to come back.

You wait and wait, and know that all the cells have restarted successfully, except one or may be sometimes two. The patching completes and the imageinfo command shows that the Active image version has been updated at all the cells, and now only if that down cell could come up....

Eventually you either restart the cell through ILOM or ask SA or you yourself hard reboot it. It comes back and you find out that the Active image version is still pointing towards the older version. You sift through logs, check the usb version and all that stuff.

This situation likely happens due to the lock on udev. So its a very good idea to check for such kind of locking before cell patching with the following command:

/opt/oracle.cellos/validations/init.d/checkdeveachboot

If you find any locks, reboot the cell, and then proceed with the patching. If it happens during middle of patching, and you find that a cell which was brought up through hard reboot has older image version, then check for these locks and reboot the cell, and apply the patch.

Friday, March 9, 2012

CheckHWnFWProfile ; FATAL is not that FATAL always

From the MOS document ID 1274318.1:

To verify the hardware and firmware configuration for a storage server, execute the following "cellcli" command as the "cellmonitor" userid:

CellCLI> alter cell validate configuration

The output will be similar to:

Cell RanDomcel08 successfully altered

If any result other than "successfully altered" is returned, investigate and correct the condition.

Ok, After a disk replacement in the cell, ran the above command on cell from cellcli, and to my horror got the following error:

[WARNING] All drives are not identical
[FATAL] Can not continue. See exceptions above

But from SR, it was relieving to know that as the firmware of the newly replaced disk was the latest one, and it could be ignored.