Version 10 (modified by jpr@…, 5 years ago) (diff)

Add disc link


Overview and Backgroud

The research computing system (RCS) is built on a collection of distinct hardware systems designed to provide specific services to applications. The RCS hardware includes dedicated compute fabrics that support high performance computing (HPC) applications where hundreds of compute cores can work together on a single application. These clusters of commodity compute hardware make it possible to do data analysis and modelling work in hours, work that would have taken months using a single computer. The clusters are connected with dedicated high bandwidth, low latency networks for applications to efficiently coordinate their actions across many computers and access a shared high speed storage system for working efficiently with terabytes of data.

Our newest hardware fabric, acquired 2012Q4, is designed to support emerging data intensive scientific computing and virtualization paradigms. This hardware is very similar to the commodity computers used by our traditional HPC fabrics, however, in addition to having many compute cores and lots of RAM, each individual computer contains 36TB of built in disk storage. Taken together, this newest hardware fabric adds 192 cores, 1TB RAM, and 420TB of storage to the RCS.

The built in disk storage is designed to support applications running local to each computer. The data intensive computing paradigm exchanges the external storage networks of traditional HPC clusters with the native, very high speed system buses that provide access to local hard disks in each computer. Large datasets are distributed across these computers and then applications are assigned to run on the specific computer that stores the portion of the dataset it has been assigned to analyze. The hardware requirements for data intensive computing closely resemble the requirements for virtualization and can benefit tremendously from the configuration flexibility that a virtualization fabric offers.

In order to enhance flexibility and further improve support for scaling research applications, we are engineering our latest hardware cluster to act as a virtualized storage and compute fabric. This enables support for a wide variety of storage and compute use cases, most prominently, ample storage capacity for reliably housing large research data collections and flexible application development and deployment capabilities that allow direct user control over all aspects of the application environment.

In short, we are tooling this hardware to build a cloud computing environment.

We are building this cloud using OpenStack for compute virtualization and Ceph for storage virtualization. Crowbar will provision the raw hardware fabric. This approach is very similar to the mode we have been following with our traditional ROCKS-based HPC cluster environment. The new approach enhances our ability to automatically provision hardware and further improve the economics large scale computing.

We are implementing this environment with Dell and Inktank. These vendors and the upstream open source projects on which this platform is built, embrace the DevOps model for systems development. This will support further engineering collaboration with our vendors, enabling the UAB research community to continually enhance our fabric as needed and feed those enhancements upstream for inclusion in future support releases.

This solution rounds out the feature set of the RCS core and will provide a general framework to scale future growth.

Getting Started

Please review these resource to get familiar with Ceph, OpenStack, and Crowbar.

System Sketch

This sketch outlines the VLAN configuration for OpenStack and Ceph. The Nova Fixed VLAN and allows isolation for the VMs using OpenStack's default "VLAN networking mode".

Schematic of cloud cluster network with notation

The VLAN configuration is based on the Dell Openstack reference architecture (High-level summary of components in July 12, 2012 announcement)