wiki:UGI-2008-001
Last modified 9 years ago Last modified on 06/13/08 13:33:33

Power Loss Event Notice for 2008-06-12

The storm that passed through Birmingham yesterday at 5:00pm knocked out power to Lister Hill Library. This is where equipment for UABgrid, @lab, openidp.org, and myvocs.org is housed. These systems started shutting down by 5:10.

Power was restored around 7:40pm according to the logs of the servers as they restarted. The physical systems are configured to automatically return to their power-on status.

By 9:00pm all systems and services were restored to normal operation.

Some services required manual intervention in order to restore. The following list of exceptions identify issues that need to be addressed:

Forced File System Checks

Servers with EXT3-based file systems which had up times exceeding the maximum time allowed between file system integrity checks were forced into file system checks during their boot sequence. All affected systems reported successful completion of the file system checks and resumed normal operation.

This affected the NAS systems that provide the user and group storage space for the @lab VO. These are the systems labeled disk2-disk4, disk6-disk7, and disk9 in rack1 (disk5 and disk8 have not been in use and remain in their power off state). It also affected the vmserver2 which provides VMware Server services to the @lab.

There is nothing technically wrong with the forced file system check but it can lead to unexpected problems, especially after unclean shutdowns due to power loss events. The NAS systems are older, slower equipment and have large data partitions (134Gb). This can delay the time before they successfully boot. If there is a dependent service waiting for data on these drives, this service can be affected as well.

The main impact here was for the vmserver2 system that runs virtual machines housed on the disk4 and disk9 NAS devices. It is not clear if that proved to be an issue. See the following discussion on VMs.

This situation needs to be resolved either by a policy by forced reboots during scheduled maintenance windows or by using alternate file systems. Scheduled maintenance windows would be easier to implement but require support schedules not typical of pre-production. ReiserFS is an alternate file systems that does not force this, but the current NAS builds don't support it.

Unintended Server Operation

The older physical system that originally served as metric.it.uab.edu but was decommissioned in July 2007 was running off a rescue CD and returned to power on status. During this unattended reboot the OpenSUSE 10.1 CD boot menu timed out and automatically booted the original OS image on the underlying system.

This brought up a server with an IP address conflicting with it's replacement server (meter.lab.ac.uab.edu). One of the services at this IP accessed by @lab clients (desktops and servers) is the LDAP account database. This caused confusion for these systems. The conflict was resolved when this system was powered off.

This system will remain in the powered off state to avoid this unintended side effect in the future. We need to verify that all services/data originally on this box has been relocated and then fully decommission it.

LDAP Inconsistent Database State

Due to the unclean shutdown, the production LDAP service on meter.lab.ac.uab.edu for account information had database files in an inconsistent state. This error wasn't apparent because slapd was running and not reporting any errors, except that it was not listing at the ldap service port (noticed by clients not recognizing users and confirmed with a getent passswd on meter not showing any of the normal user accounts). The /var/log/messages didn't indicate any error condition, though.

OpenLDAP uses BerkeleyDB backends by default and they are sensitive to corruption. The server was shutdown, the databases consistency cleared and the server restarted. This brought LDAP service for @lab back into operation:

service ldap stop
cd /var/lib/ldap
db_recover -v
service ldap start

VMware Server Restart Failures

UABgrid Pilot Platform

All of the core services except clusters and data storage for the UABgrid pilot are run as virtual machines. This includes the identity and collaboration services on vo.uabgrid.uab.edu, the trac instances for projects.uabgrid and dev.uabgrid, ca.uabgrid, wiki.uabgrid, and apps.uabgrid.

The vmware server platform on gridhost1 was not able to restart successfully. Client connecting to port 904 wasn't possible and attempts to restart vmware on the host met with errors from vmmon.

The first problem was an used VM that whose directory had been removed. This was taken out of the default /srv/vm directory to avoid the error but that didn't resolve the problem. Errors with vmmon were still reported.

After forcing the load of vmmon.ko, an error was finally reported that vmware server had not been configured yet. Running vmware-config stepped through the basic steps again but left all the defaults in place (no rebuilding of modules or reconfiguring of network was needed). The reconfig fixed the vmmon problem and started vmware services successfully.

The above VMs were restarted manually via vmware-server-console. All appeared to start successfully. The VM projects.uabgrid reported some non-critical file system messages and was rebooted a second time to ensure proper startup.

The VM wiki.uabgrid hosts the test Confluence instance an it required manual startup of the Tomcat container

ssh wiki.uabgrid
sudo su - jelai
$CATALINA_HOME/bin/startup.sh
exit
exit

Login to all services was tested and appears to be operating normally. We need to make sure these VMs are started automatically in the future.

@lab Development Platform

The vmserver2.lab host wouldn't display on the rack console after power on. It had been having trouble earlier in the day, so it's state was unkown. It was hard power cycled and recovered successfully. The in-use development VMs were started manually. DrupalRH7.3 and WinXP appear to be operating normally.

Comments

While frustrating, this event helped highlight where infrastructure work is needed and the granularity of service that is needed for a variety of components. It also clarifies some of the dependencies that need resolution and helps highlight the framework for UABgrid infrastructure services that can be supplied to Virtual Organizations (VO).

We need to convert these issues into tickets and feature improvements.