wiki:DevOps-2012-12-18
Last modified 7 years ago Last modified on 12/18/12 13:02:57

DevOps Weekly Meeting | December 18 2012

Time & Location: 11am-12:30am in LHL164

Attending

mhanby, tanthony, jpr, pavgi, billb

2012-12-18 Agenda

  • Agenda bash
  • Updates
  • HPC shutdown and updates
  • Galaxy FTP and Lustre testing
  • Community updates

Summary

Meeting focused on the updates for the service window with side discussions on key features that need to brought online in the near future. Additional updates on Galaxy FTP tests and user community updates.

Discussion

Service Window

The fibre channel in RUST on DDN has 2x2Gb connections and 2x8GB connection. The RUST to BEC data center link for the OSS's are 4Gbs. So we have an imbalance on the DDN switch interface that should be resolved and could potentially resolved with the new SAN and fiber switch going in RUST. We have the data from the OST scans that indicates an imbalance in performance between these links.

Our main issue is for the Lustre, since there is no upgrade in this window, is to resolve the mount mapping. We need to build our /scratch/[user/shared] name space and would like to retain read-only access to the lustre-recovered space, for continued recovery operation. There is still a potential to do an upgrade if we need it to resolve the sub-volume mounting. We also need to test bind mounts as a potential solutions.

Assuming no bind mount or no sub-volume, then we need to figure out how to build our scratch space. The goal is /scratch/[user|shared|local|custom|etc]. How to avoid links that could be discovered by users so that use of non-canonical names.

Other updates, compute nodes and servers will all get firmware updates and OS updates if needed. Will debug the Cheaha 10G link issue and the SGE RPM updates will be applied to the compute nodes so we can run a newer version that will fix the max RAM report and the occasional crashes.

The large memory nodes have a preliminary boot, but the broadcom drivers are incompatible. This is the tg3 driver but there is a specific version not available on the current initrd. Mike will try putting the new driver in the existing initrd for the PXE boot of the large memory nodes.

Side project is moving head node services to VM, eg. SGE, DHCP, ROCKS admin. We need to move to ROCKS 6 (or 7) at some point. Also the EL6 kernel has more control over process resources so we can balance load a little more via security limits conf file. We can also go the route of dynamically provisioning extra login nodes within ROCKS..

Upcoming is to complete the management network between BEC and RUST. Bill will create Kintana ticket to rack switch and connect iDRAC interfaces from RUST to the BEC/RCS management network.

Also working to update the InCommon? key configured for the web services log in to a 2048 bit key pair. The current vo.uabgrid system uses Shib 1.3 and a 1024-bit key pair. It may be necessary to update the entire stack of software supporting vo.uabgrid since it was originally built as an purpose-built device (embedded VM) in Fall 2006 as part of the myVocs project.

Lustre + Galaxy FTP tests

The Galaxy FTP upload code has a habit of moving data from the import directory (/scratch/importfs/$USER) processing the data and then moves it back, closing with a chmod mode. This works fine for binary but for text it copies from the importfs, to the /scratch/shared/galaxy/temp, processes it and moves it back to the importfs, concluding with the chmod. This would ordinarily not be expected to work. A test environment for Lustre has been set up and is being updated to match our existing Lustre environment. Will test with the re-export of

User Community Updates

COINS-2 study support, data has been staged, documentation in online docs, processing data after service window, and then QA analysis in early January. Potential paper submission to IEEE Southeast Conference, in Jascksonville FL in April 4-7.