wiki:DevOps-2012-12-04
Last modified 7 years ago Last modified on 12/04/12 13:23:59

DevOps Weekly Meeting | December 4 2012

Time & Location: 11am-12:30am in LHL164

Attending

mhanby, tanthony, jpr, billb, pavgi, dls

2012-12-04 Agenda

  • Agenda bash
  • Updates
  • Lustre status
    • lustre is online in read-only
    • dedicated interactive queue to access data qlogin -l h_rt=4:00:00 -l vf=256M -q lustre-recovery.q
    • continuing to work with users to recover data
    • some users have recovered a lot of data
  • Research storage
    • 6 nodes configured with gluster at 167TB in a simple distributed configuration on raid6 volumes
    • will test with select users
  • Cluster stats
    • queue wait time analysis
    • impact of new compute acquisitions
  • Info Security
    • comments on language of draft

Summary

Discuss cluster statistics and queue wait times and what impact new hardware could have on these numbers. Good discussion on current status of research storage. Also discussed further comments on info sec policy and where research fits in. In context of ECAR report UAB is top third of public research universities so should echo our peers in this space.

Discussion

On the cluster stats:

Getting 8 cores on 1 die is harder to do than getting 32 or 64 cores across cluster. From our plots of queue wait time we see that 8 core and 128 core requests are dominated by the queue wait times. 32 and 64 core requests are satisfied quickly. What we know intuitively is that the 8 core are likely SMP requests and the 32 and 64 core are popular with the the mpi community. The 8 core are our most popular multi-core target, followed by 32 and 64. The 8 core is likely popular with NGS and other SMP members. The 32 and 64 are popular with the modeling community. It may be that they would like 128 cores or above but that is hard to get on our cluster. We may want to also configure 8 of our gen2 8 core nodes as SMP only targets to reduce the wait time for full machine slots. Down side is that we have only 16GB of RAM on these, so we may want to go with 8 gen3 systems instead. What's not clear at this time is if our researchers science is bounded by the upper limits of mpi availability >=128 cores or if they are inherently bound to <128 by their applications.

If we acquire hardware we would likely retain the gen3 and luster fabrics to colo with new hardware.

On the research storage:

We have half the new servers in a gluster configuration with 167TB. We will work with some of the users recovering data to exercise the configuration by making some of the space available for recovery. We will also be able to target our extension of the recovery time window into 2013 to assist those needing more time. This footprint of 6 servers is half of number of servers for our research storage, so we should be able to still develop research storage with the second half especially if we don't need to move the servers to huntsville right away.

Attachments