wiki:ResearchStorageSystem
Last modified 12 years ago Last modified on 06/29/09 15:00:15

Overview

The goal for the Research Storage System is to provide a single interface to storage services for research, much like we are working on a common interface to HPC capacity. The storage system is being designed to support significant growth and multiple back-end technologies, without exposing these details to users.

Over time, the research storage system will provide multi-protocol interfacing to allow researchers to access files from a large number of clients, including traditional file sharing protocols (SMB/NFS), web protocols, and grid protocols. This should facilitate collaboration and enable the construction of advanced computational work flows that can harness data resources regardless of their location.

The first phase of the research storage system will make significant capacity available for project data and computing buffers. The storage system will be generally accessible through the 10GigE research network and will be directly accessible to Cheaha via a new Infiniband network being added to Cheaha. The IB network will allow each compute node in Cheaha to directly access the capacity and performance of the research storage system and will also enable running larger-scale jobs that require low-latency communication.

The initial implementation of the storage system will use Luster. Information on Lustre and documentation of the development activity can be found on the LustreIntroduction page.

Other systems will be able access the research storage system through the 10GigE research network, but direct IB connectivity would require additional IB hardware components to be purchased for any cluster that wants this level of access. The research storage system is part of the StorageCity.

Research Storage System (RSS)

Systems Perspective

This perspective shows the services provided by the storage system, highlighting some of the underlying technologies:

Research Storage System

Logical Schematic

This perspective shows the schematics for connectivity between the storage system and other consuming devices. Specifically, it highlights the Infiniband network connectivity to Cheaha's IB network and the direct connectivity Cheaha's compute nodes will have to the Research Storage System

Research Storage System

Implementation Objectives

A primary objective for the development of a shared storage pool should be to achieve economies of scale. That is, we need our storage costs to go down as the size of the storage pool increases, relative to the similar storage costs incurred should a group buy their own disks.

On-going analysis of our costs will be an important metric in determining which solutions make the most sense. We may find that coordinated orchestration of smaller-scale systems is more cost effective than single-unit, enterprise systems (along the lines of the Condo model). Or, we may find the opposite. What's important is that we can measure our the effectiveness of our storage investments over time.

We have been looking at solutions from Dell, Hitachi, Data Direct Networks and others, to understand what flexibility exists in their various offerings to support the identified requirements and what options we have for integrating their offerings with our existing system environment.

Requirement 1 in the bullets above describes a need for parallel file systems. While parallel file systems generally improve performance for shared disk systems, they also add a significantly to the cost. While we know there are some applications which would benefit from parallel file systems, we don't seem to have enough performance data yet to justify the additional cost. In light of this, it seems reasonable for us to focus on the shared storage pool (requirement 2) in this initial purchase. We can use the real-world performance metrics of whichever initial storage system is purchased to help us build our requirements understanding for the next purchase.

Requirements

Understanding Storage Needs

We have been investigating how a shared data store could be consumed by clients and what features this storage can provide to those clients. Our primary requirements have focused on two areas:

  1. use of storage by HPC applications to enable computation. This may require parallel file systems to ensure efficient data access to all nodes involved in a computation. Pre-staging data sets to compute nodes is one way to work around the lack of parallel file systems, but requires coordination with the compute job and potentially scheduling system (in order to assign jobs to the nodes where the data is pre-staged). This use-case is focused on building a "/shared/scratch" storage for Cheaha.
  2. use of storage by research groups to enable collaboration. This requirement looks at how data can be stored in a common store with transparent access to multiple resources (clusters and labs). This is focused on allowing researchers to "upload" their data once and then have it be accessible across the clusters or across their collaboration space. This requires multi-protocol access to data stores, including NAS services for near-by clients (clusters and/or research labs) and grid-FTP/HTTPS access for wider data distribution. This use-case has been focused on building a storage pool that can be accessed as "/data" from Cheaha or other clusters on the research network (eg. CIS is interested in this access from Ferrum) and be accessed widely with a name like "data.uabgrid.uab.edu"

From the infrastructure provider perspective, we have identified a number of requirements for supporting the full spectrum of research IT needs, especially as they relate to storage. These requirements have emerged from our UABgrid development efforts. One of our final goals for the UABgrid Pilot is migrating the @lab to UABgrid. The @lab is the development group which has produced much of the UABgrid infrastructure. Moving the @lab to UABgrid involves moving the resources that we leverage in our development efforts onto the infrastructure provided by UABgrid. These tools include mailing lists, wikis, trac, and the virtual machines (VM) on which the UABgrid services are built. It is the "shared file system" which will ultimately store this data.

Throughout the development of UABgrid, the @lab has served as our model for how other research groups should be able to leverage infrastructure available via UABgrid. We have already migrated portions of the @lab to the existing UABgrid Pilot infrastructure. Completing this migration relies heavily on how we implement our storage planning.

Conducting research and running a lab is about more than just running compute jobs on HPC equipment. It includes communication and planning tools like mailing lists, wikis, code repositories and other web and non-web applications. It should be possible for research groups to instantiate their resources on demand . We should provide an infrastructure that enables the technology professionals within these research groups to customize our services to address specific needs of their communities. UABgrid is about providing resources that these groups can control and shape to meet local requirements. Clearly the "cloud" concept has captured this pent up user demand. We have kept our eyes on the the OpenNebula project (from the same folks who develop GridWay, our grid meta-scheduling solution), and are interested in further exploring its features. (Note that this is one of the motivations for keep the older Cheaha compute nodes available. There is not a big difference between a cluster that provides compute cycles and a cluster that runs VMs.)

The comments on the @lab and OpenNebula are intended to highlight the path we have been following and share the requirements that we see as drivers behind our infrastructure development. They are not intended expand the complexity of our current storage project, but clearly these requirements fall under bullet "2" above, a shared storage pool that hosts the data objects of research groups.

Immediate Demands

The two aspects to the demand for storage from SSG are to get larger data sets on-line so they are available to the clusters and to enable computation on those data sets:

  1. there are new gene sampling methods coming on-line which generate much larger data sets. These data sets need to be managed, for further analysis and (I suspect) for archival purposes. The data sets are estimated at 100GB with the expectation that they will grow larger as the data sampling processes improve.
  2. there are new data processing methods (eg. BirdSuite?) coming on-line which accept these data sets as inputs. These processes consume a significant amount of additional storage during the computation. Increased scratch (temporary) storage space is needed to expand the input data sets. The current estimate is that the 100GB input file will consume 500GB during computation and generate a result data set that is some fraction smaller than the input set. In otherwords, the scratch space need is about 0.5TB per computation, with simultaneous computations scaling linearly.

From what SSG has shared, they are not sure how dense their computations will be (ie. how many simultaneous computations will occur) or how intensive their I/O demands will be (ie. is the computation ultimately bound by how fast it can read/write data).

Depending on what SSG discovers during their exploration of BirdSuite?, our storage solution may need to address I/O demands in the near future.

Background Information

This material is for reference and was developed during our requirements gathering phase.

System Outline

This is a generic schematic that uses SAN and NAS in the broadest sense: a storage area network(SAN) is a network dedicated to accessing raw storage at the block level, ie. the storage is presented to the client as a raw block device, and network attached storage is presents to a client as a logical collection of files.

We have developed a design for our initial investment in the shared storage pool that seeks to balance flexibility and performance. The following diagram is a pictorial representation of this solution. It leaves open, some questions about where this unit should be attached on order to provide a idea of how our systems can interface with the storage. (In other words, this isn't a strict network schematic.)

Storage Design

Implementation Sketch

The storage sketched below shows how a SAN located in RUST can be connected to a NAS node in BEC that can provide direct access to the storage to research groups as well as to clusters which would like access to these data files within their file namespace. As in the diagram above, the NAS will likely be muti-homed in some fashion (direct or via a switch) to facilitate this connectivity.

Research SAN Sketch

References

  • NCSA Filesystems - site describes the files systems available on the NCSA clusters and their recommended uses. The list includes NFS, GPFS (General Parallel File System from IBM), Lustre, and PVFS
  • GPFS (General Parallel File System) - Originally for IBM AIX but ported to Linux
  • Lustre - open source parallesl file system, with commercial support by Sun
  • PVFS - parallel virtual file system developed at Argonne and Clemson.
  • NFS - network file system, standard data sharing mechanism across cluster. Standardized by IETF RFCs and continues to see active development, eg. NFSv4 which anticipates support for parallel access.