wiki:DevOps-2012-11-27
Last modified 7 years ago Last modified on 11/27/12 13:22:26

DevOps Weekly Meeting | November, 27 2012

Time & Location: 11am-12:30am in LHL164

Attending

mhanby, tanthony, jpr

2012-11-27 Agenda

  • Agenda bash
  • Updates
  • Lustre status
    • lustre is online in read-only
    • dedicated interactive queue to access data qlogin -l h_rt=4:00:00 -l vf=256M -q lustre-recovery.q
    • working with users to recover data
    • reviewed tanthony recovered scratch and collaboration group files during meeting
  • Large mem nodes
    • IB is here
    • should go online this week
    • kickstart issues with drivers for network card
  • Research storage
    • nodes in RUST on new rack
    • new nodes will be wired to force10
    • need to monitor ups load before remaining 4 go online
    • exchange multimode uplink with singlemode
  • Collaboration group support
    • alternate username with -grouname extension to allow login with a different default group and inheriting alternate group across the scheduler (qsub/qlogin)
    • dedicated head node for collaboration group

Summary

The meeting was focused on the stepping through the lustre-recovery steps we've been sharing with users, with the focus on tanthony. There was a side discussions on the collab group support. The rest of the items were simple status updates.

Discussion

For users that participate in multiple groups it is desirable to operate on different projects with the group membership of that project. Linux has the newgrp command that lets you switch to an alternate group that you are a member of, but this just creates a sub-shell with the new settings and more significantly the settings don't persist across machine boundaries. This means that a job submitted from the sub-shell will run on the cluster with your primary group membership.

One solution is to use account aliases (alternate usernames) that would be tied to a different primary group. A user could then ssh to the head node with the aliased account and that session would use the alternate primary group membership for their collaboration activities. This has the advantage of also working (naturally) across host boundaries so long as the aliased account is used. Also by having the group id of collaboration as the primary group SGE accounts for the cluster usage under that group name. This may be our best near term solution and we will tried it out with tanthony.

Another solution is to create dedicated head/login nodes for collaboration groups. This has the advantage of retaining the users single log in name and only changing the group membership for interactions with that node. It also allows for a lot of customization of an environment for the collaboration group, ie. a file system namespace tuned for the group with custom apps and even highly tuned file system access (restricted dirs or lab storage come to mind). The problem would come when leaving that space via SGE to access the HPC resources. SGE would base its process setup on the global password file.

Finally, this is potentially only a limitation of SGE and there may be better solutions with slurm or condor, our eventual scheduling targets.