wiki:DevOps-2014-01-07
Last modified 6 years ago Last modified on 01/08/14 15:40:52

DevOps Weekly Meeting | January 7, 2014

Time & Location: 9:30am-10:30am in LHL164

Attending

jpr, mhanby, rpillai

2014-01-07 Agenda

  • Comparing RCS provisioned Nagios to requirements of IdmOps? team

Summary

We reviewed the monitoring requirements provided by rpillai for the IdM project. We mapped the requirements to Nagios-check_mk features and reviewed existing setup of the RCS Nagios platform as comparison point. The goal is either that a common Nagios platform could result or at the least an origination point for local requirements which we could use across a number of distinct instances.

Discussion

The IdM project has outlined a set of requirements for monitoring that rpillai shared. We used them as a launching point to review features of the Nagios platform and specifics of the Nagios implementation used in RCS for monitoring our HPC and ancillary services (HPC++). Nagios (not the check_mk upstream) is also delivered with the Crowbar provisioned fabric for OpenStackPlusCeph. There is also a strong potential for offering monitoring-as-a-service for tenants of our OpenStack cloud platform, e.g. the UAB Galaxy project has expressed an interest in leveraging Nagios for their services.

All of these use-cases could benefit from either a common instance, assuming Nagios can offer autonomy between user groups, or by deriving specific instance of Nagios for different consumers, possibly provisioned from a registered image in OpenStack.

In our discussion Nagios refers to the Nagios check_mk fork by Mathias Kettner unless indicated. The download for the version of Nagios we use in the RCS is http://omdistro.org/download. This is the source code for the enhancements. The OMD download bundles available via Mathias Kettner's site are supported bundles. According to mhanby, "[Mathias] provides current builds of OMD with the latest CheckMK For the rest of us, you either have to wait for the public OMD to build (which includes a suite of Nagios extensions) or alternately update check_mk on the side using the individual packages available on Mathias' site. In other words, you can obtain everything without support. It's just not prebundled as often."

Nagios used by RCS

The Research Computing System (RCS) uses Nagios to monitor servers and services. Our main Nagios interface is https://nagios.uabgrid.uab.edu/uabgrid which monitors our traditional HPC and HPC++ services. The Crowbar provisioned Nagios (non-check_mk) for the OpenStackPlusCeph platform is accessible from within the cluster fabric.

The configuration for the RCS provisioned Nagios is maintained in the @lab repo. The service is hosted on a KVM hypervisor and evolved from a stock Nagios to OMD install over time. It is due for an updated rebuild and then side-load of the existing configuration. This DevOps discussion will be a good foundation for rebuilding using the RCS provisioning fabric.

Our existing Nagios is a multisite install so it includes more "enterprise-y" features. We're not using LDAP-based authentication nor the LDAP account synchronization feature of multisite. The traditional htpassword (HTTP Basic Auth) mechanism looks to be a better fit for our RCS web SSO fabric. Also, the documentation for the ldap plugin suggests search-and-replicate type features which may not be practical for large user groups. Integration with RCS the IdM services looks to require the typical design trade offs.

It's also a good opportunity to take a deeper look at the Crowbar provisioned Nagios and see if that could instead be built from the check_mk release as well.

IDM Monitoring Requirements

The requirements identified for IDM follow. The links to the check_mk configuration documentation and notes gather during the meeting are included in-line below.

From a quick overview of the documentation it seems that Nagios supports these requirements, for some in a rudimentary way. For example, the admin user can change some of the settings but they appear to rely on text file editing. That just means that opening admin rights to a larger group or subsetting admin privileges will require coding delegation mechanism.

The full documentation is on-line for further mapping to requirements: http://mathias-kettner.com/checkmk.html

  • Planning Considerations:
    • Use roles to segregate responsibilities
    • For the groups created, who can do these operations?
      • Change group membership
      • Grant privileges on the group to other users
    • Who can do these operations on the targets in the group?
      • Add/delete the target
      • Define monitoring settings
      • Define notification settings
      • View/receive notifications for alerts
      • Acknowledge an alert
      • Act on target to resolve alert
      • Blackout target for planned or unplanned downtime