Version 15 (modified by jpr@…, 5 years ago) (diff)

Add documentation links section


Overview and Backgroud

The research computing system (RCS) is built on a collection of distinct hardware systems designed to provide specific services to applications. The RCS hardware includes dedicated compute fabrics that support high performance computing (HPC) applications where hundreds of compute cores can work together on a single application. These clusters of commodity compute hardware make it possible to do data analysis and modelling work in hours, work that would have taken months using a single computer. The clusters are connected with dedicated high bandwidth, low latency networks for applications to efficiently coordinate their actions across many computers and access a shared high speed storage system for working efficiently with terabytes of data.

Our newest hardware fabric, acquired 2012Q4, is designed to support emerging data intensive scientific computing and virtualization paradigms. This hardware is very similar to the commodity computers used by our traditional HPC fabrics, however, in addition to having many compute cores and lots of RAM, each individual computer contains 36TB of built in disk storage. Taken together, this newest hardware fabric adds 192 cores, 1TB RAM, and 420TB of storage to the RCS.

The built in disk storage is designed to support applications running local to each computer. The data intensive computing paradigm exchanges the external storage networks of traditional HPC clusters with the native, very high speed system buses that provide access to local hard disks in each computer. Large datasets are distributed across these computers and then applications are assigned to run on the specific computer that stores the portion of the dataset it has been assigned to analyze. The hardware requirements for data intensive computing closely resemble the requirements for virtualization and can benefit tremendously from the configuration flexibility that a virtualization fabric offers.

In order to enhance flexibility and further improve support for scaling research applications, we are engineering our latest hardware cluster to act as a virtualized storage and compute fabric. This enables support for a wide variety of storage and compute use cases, most prominently, ample storage capacity for reliably housing large research data collections and flexible application development and deployment capabilities that allow direct user control over all aspects of the application environment.

In short, we are tooling this hardware to build a cloud computing environment.

We are building this cloud using OpenStack for compute virtualization and Ceph for storage virtualization. Crowbar will provision the raw hardware fabric. This approach is very similar to the mode we have been following with our traditional ROCKS-based HPC cluster environment. The new approach enhances our ability to automatically provision hardware and further improve the economics large scale computing.

We are implementing this environment with Dell and Inktank. These vendors and the upstream open source projects on which this platform is built, embrace the DevOps model for systems development. This will support further engineering collaboration with our vendors, enabling the UAB research community to continually enhance our fabric as needed and feed those enhancements upstream for inclusion in future support releases.

This solution rounds out the feature set of the RCS core and will provide a general framework to scale future growth.

Getting Started

Please review these resource to get familiar with Ceph, OpenStack, and Crowbar.


Online documentation for Ceph and OpenStack are available. Be aware that our pilot currently uses the Essex OpenStack release and the XXX Ceph release. These older releases may not have all the features of the latest releases, however, sometimes the documentation for the more recent releases is better (this is especially true for openstack), so it is worth reading both the current release document for better understanding the operation and vision and then returning to the older documentation for specific steps.

System Sketch

This sketch outlines the VLAN configuration for OpenStack and Ceph. The Nova Fixed VLAN and allows isolation for the VMs using OpenStack's default "VLAN networking mode".

Schematic of cloud cluster network with notation

The VLAN configuration is based on the Dell Openstack reference architecture (High-level summary of components in July 12, 2012 announcement)

IP Ranges

Proposed IP ranges in the public space will be based on the /27 netmask so we will have "distinct" networks (really IP address groups, since we aren't actually routing). This creates a IP grouping mask of the 3 high bits in the last octet and leaves the lower 5 bits for host numbers. The groups are of the form, .32/27, .64/27, .96/27, .128/27, .160/27, .192/27, .224/27. These will be chunks of addresses we can assign down to the openstack and ceph public network.

Working with OpenStack and Ceph

Accessing the Pilot Platform

Currently, the pilot platform is only accessible from within the Research Computing System (RCS). Effective interaction requires you to set up a cluster desktop. Once you are connected to your cluster desktop, open a terminal (Applications->Accessories->Terminal) and create a tunnel for X11 traffic to the gateway node to our pilot network with ssh -X rcs-srv-02. I this SSH connection start firefox with firefox This will start firefox on the gateway, with a connection open to the OpenStack controller, displayed on you cluster desktop via X11 forwarding. You can then log into the OpenStack environment.

Note that this configuration assumes you are authorized to ssh to rcs-srv-02 gateway from within the RCS. It also assumes you have an account to log into the OpenStack controller. If you are interested in participating in this pilot and feel you qualify, please send a request to support@…. Please understand that at this time only close collaborators will be authorized.