wiki:DevOps-2013-07-23
Last modified 6 years ago Last modified on 07/23/13 13:18:25

DevOps Weekly Meeting | July 23, 2013

Time & Location: 10am-1:00pm in LHL164

Attending

tanthony, jpr, pavgi, billb

2013-07-23 Agenda

Summary

  • Debug rcsfw issues
    • Fix Galaxy default route setting for rcsfw fabric
  • OpenStackPlusCeph
    • Fix rcs-srv-02 NAT rules
    • ai: need to create a uab public to floating-public translation table
    • ai: need to embed table in DNS
    • ai: need an ubuntu desktop image in glance. may require contortion of launching vm with iso or getting iso in glance and then installing into a volume and then launch a subsequent instance from that volume
    • ai: jpr: apply changes to ceph read caching, requires nova-compute restart. pending understanding crowbar and chef
    • todo: we need access to the admin node interface via the controller
    • todo: we need to engage with dell on crowbar limitation on storage use, don't know if we get improvements from storage.
  • NGS & clcgenomics trial
    • Invitations sent
    • discussing ram upgrades for gen3 nodes to support NGS SMP codes
  • dspace report.
    • ai: jpr: need to complete draft

Discussion

Reviewed examples of several public research clouds to show how we should present our efforts to the user community. These clouds provide a great example both of the kind of science being conducted using clouds as a fabric and for how to engage the end user in those services. The public data sets hosted on https://www.opensciencedatacloud.org/publicdata are also key to seeing what kind of data sets can be made available and how, both for public download (even using UDT) and by way of hosted VM images.

It's interesting to see GlusterFS heavily used in this cloud instance. We are using it too as a stand in for some of our space constraints but plan to move away from it, or rather move it up the stack. Ceph will be our core data manifestation fabric. Granted if we need more performance we would "compile" a virtual glusterfs implementation down to real hardware. An aside: Ceph supports a client cache but it should be turned off when implementing distributed file systems, allowing the distributed fs to handle cache coherency.

Worked through several firewall fabric issues. The galaxy vm had the old host-specific outbound route defined that was preventing operations because inbound traffic was getting blocked. The fix was simply to change the default route.

The break in the NAT rules for the OpenStack network occurred due to applying the rcs-srv-02 role change to an older version of the firewall rules that hadn't had the OS NAT rules applied. The role change was repeated on top of NAT rule changes from June 4 and now the OpenStackPlusCeph fabric is back in operation as before.

Additionally, there haven't been any more interface errors on rcs-srv-01 since the interface was replaced last week. This is a good sign that we may have finally addressed the long standing and puzzling intermittent access errors the the rcsfw fabric.