wiki:DevOps-2013-01-22
Last modified 7 years ago Last modified on 01/22/13 14:38:33

DevOps Weekly Meeting | January 22, 2013

Time & Location: 11am-12:30am in LHL164

Attending

mhanby, tanthony, jpr, pavgi

2013-01-22 Agenda

  • Agenda bash
  • Updates
    • HPC shutdown and updates successful
    • UABgrid InCommon? 2048-bit key rollover complete
    • VM build process verified ticket:212
  • Research Cloud
    • Jumpstart config from Dell for Openstack+Ceph+Crowbar
    • Shrinking glusterfs to make room for ceph
  • Luster updates and issues
    • Space issues: one OST is down to 88Gb
  • Data recovery process
    • Development needed
  • ScaleMP and large memory nodes
  • Community growth & updates
    • New users
      • Philosphy
      • Neurology
      • SRI

Summary

HPC shutdown and software updates completed successfully. Lustre is back in production operation. A process for removing recovered files into user-containers needs to be developed with an implementation target of April 2013. A new /scratch namespace has been configured with all temporary storage rooted in that tree (eg. Lustre, Gluster, and local disk).

Luster OSS right now has gone to a high load, high load since Sunday night. The cause is not yet known.

ScaleMP has had some challenges to installing on the new large memory compute nodes from SSG. May be going to a pre-release of next version for getting Broadcom support. Not clear where we are on support status.

Discussion

Data and Storage

Theme for the year, "getting to know your data". Learning about data creation patters so researchers can keep a cleaner data footprint on the storage and maximize available storage. One example, a workflow for an experiment that creates 1000's of files that each contain one 8 byte number. It is perfectly fine to develop a workflow like this but it is a very bad idea to store such files on a network share. It is best to hold on to this data collection in a tarball and then extract it for the purpose of the job in /scratch/local/<dir>.

We need to move toward having the SGE TMP space automatically staged to /scratch/local. We will also need to create SGE resources for this space so jobs can be assigned properly.

How much storage is on a compute node is the balance between compute intensive and data intensive computing. Our "cloud" compute nodes purchased last year were explicitly configured for these data intensive workflows, ie. lots of storage on the nodes 36TB to be exact.

Summary of Openstack+Ceph+Crowbar install from Dell

This is a platform that we hope to implement with Dell's Jumpstart package. It would give us a completed platform for cloud-like services to start building on rather than first having to implement. Our new hardware the 12 720DX's with 36TB disk, 96GB RAM and 16cores are certified and the right target. Also Inktank confirmed that installing Ceph doesn't give over the whole system to Ceph, so folks can (and some are) deploying things like hadoop on the ceph nodes. We hope to do this with openstack on the ceph nodes.