Changes between Initial Version and Version 1 of UGI-2008-001


Ignore:
Timestamp:
06/13/08 13:33:33 (9 years ago)
Author:
jpr@…
Comment:

Details of power loss event

Legend:

Unmodified
Added
Removed
Modified
  • UGI-2008-001

    v1 v1  
     1[[TOC]] 
     2= Power Loss Event Notice for 2008-06-12 = 
     3 
     4The storm that passed through Birmingham yesterday at 5:00pm knocked out power to Lister Hill Library.  This is where equipment for UABgrid, @lab, openidp.org, and myvocs.org is housed.  These systems started shutting down by 5:10. 
     5 
     6Power was restored around 7:40pm according to the logs of the servers as they restarted.  The physical systems are configured to automatically return to their power-on status.  
     7 
     8By 9:00pm all systems and services were restored to normal operation. 
     9 
     10Some services required manual intervention in order to restore. The following list of exceptions identify issues that need to be addressed: 
     11 
     12== Forced File System Checks == 
     13 
     14Servers with EXT3-based file systems which had up times exceeding the maximum time allowed between file system integrity checks were forced into file system checks during their boot sequence.   All affected systems reported successful completion of the file system checks and resumed normal operation. 
     15 
     16This affected the NAS systems that provide the user and group storage space for the @lab VO. These are the systems labeled disk2-disk4, disk6-disk7, and disk9 in rack1 (disk5 and disk8 have not been in use and remain in their power off state).  It also affected the vmserver2 which provides VMware Server services to the @lab. 
     17 
     18There is nothing technically wrong with the forced file system check but it can lead to unexpected problems, especially after unclean shutdowns due to power loss events.  The NAS systems are older, slower equipment and have large data partitions (134Gb).  This can delay the time before they successfully boot.  If there is a dependent service waiting for data on these drives, this service can be affected as well. 
     19 
     20The main impact here was for the vmserver2 system that runs virtual machines housed on the disk4 and disk9 NAS devices.  It is not clear if that proved to be an issue. See the following discussion on VMs. 
     21 
     22This situation needs to be resolved either by a policy by forced reboots during scheduled maintenance windows or by using alternate file systems.  Scheduled maintenance windows would be easier to implement but require support schedules not typical of pre-production.  ReiserFS is an alternate file systems that does not force this, but the current NAS builds don't support it. 
     23 
     24== Unintended Server Operation == 
     25 
     26The older physical system that originally served as metric.it.uab.edu but was decommissioned in July 2007 was running off a rescue CD and returned to power on status.  During this unattended reboot the OpenSUSE 10.1 CD boot menu timed out and automatically booted the original OS image on the underlying system.   
     27 
     28This brought up a server with an IP address conflicting with it's replacement server (meter.lab.ac.uab.edu).  One of the services at this IP accessed by @lab clients (desktops and servers) is the LDAP account database.  This caused confusion for these systems.  The conflict was resolved when this system was powered off. 
     29 
     30This system will remain in the powered off state to avoid this unintended side effect in the future.  We need to verify that all services/data originally on this box has been relocated and then fully decommission it. 
     31 
     32 
     33== LDAP Inconsistent Database State == 
     34 
     35Due to the unclean shutdown, the production LDAP service on meter.lab.ac.uab.edu for account information had database files in an inconsistent state.  This error wasn't apparent because slapd was running and not reporting any errors, except that it was not listing at the ldap service port (noticed by clients not recognizing users and confirmed with a `getent passswd` on meter not showing any of the normal user accounts).  The /var/log/messages didn't indicate any error condition, though. 
     36 
     37OpenLDAP uses BerkeleyDB backends by default and they are sensitive to corruption.  The server was shutdown, the databases consistency cleared and the server restarted. This brought LDAP service for @lab back into operation: 
     38{{{ 
     39service ldap stop 
     40cd /var/lib/ldap 
     41db_recover -v 
     42service ldap start 
     43}}} 
     44 
     45== VMware Server Restart Failures == 
     46 
     47=== UABgrid Pilot Platform === 
     48 
     49All of the core services except clusters and data storage for the UABgrid pilot are run as virtual machines.  This includes the identity and collaboration services on vo.uabgrid.uab.edu, the trac instances for projects.uabgrid and dev.uabgrid, ca.uabgrid, wiki.uabgrid, and apps.uabgrid. 
     50 
     51The vmware server platform on gridhost1 was not able to restart successfully.  Client connecting to port 904 wasn't possible and attempts to restart vmware on the host met with errors from vmmon.   
     52 
     53The first problem was an used VM that whose directory had been removed. This was taken out of the default /srv/vm directory to avoid the error but that didn't resolve the problem. Errors with vmmon were still reported. 
     54 
     55After forcing the load of vmmon.ko, an error was finally reported that vmware server had not been configured yet.  Running vmware-config stepped through the basic steps again but left all the defaults in place (no rebuilding of modules or reconfiguring of network was needed).  The reconfig fixed the vmmon problem and started vmware services successfully.  
     56 
     57The above VMs were restarted manually via vmware-server-console.  All appeared to start successfully.  The VM projects.uabgrid reported some non-critical file system messages and was rebooted a second time to ensure proper startup.  
     58 
     59The VM wiki.uabgrid hosts the test Confluence instance an it required manual startup of the Tomcat container 
     60{{{ 
     61ssh wiki.uabgrid 
     62sudo su - jelai 
     63$CATALINA_HOME/bin/startup.sh 
     64exit 
     65exit 
     66}}} 
     67 
     68Login to all services was tested and appears to be operating normally. We need to make sure these VMs are started automatically in the future. 
     69 
     70=== @lab Development Platform === 
     71 
     72The vmserver2.lab host wouldn't display on the rack console after power on.  It had been having trouble earlier in the day, so it's state was unkown. It was hard power cycled and recovered successfully. The in-use development VMs were started manually.  DrupalRH7.3 and WinXP appear to be operating normally. 
     73 
     74== Comments == 
     75 
     76While frustrating, this event helped highlight where infrastructure work is needed and the granularity of service that is needed for a variety of components.  It also clarifies some of the dependencies that need resolution and helps highlight the framework for UABgrid infrastructure services that can be supplied to Virtual Organizations (VO). 
     77 
     78We need to convert these issues into tickets and feature improvements.