wiki:SystemAdministration/Monitoring/Icinga

Version 7 (modified by Dirk Goldhahn, 8 years ago) (diff)

--

Monitoring CLARIN infra using Icinga

1. Planning

The extensive CLARIN technical infrastructure, centrally as well as at centres, should be monitored constantly. This serves two goals:

  1. Unavailabilty is detected, communicated and corrected as soon as possible.
  2. Automatic monitoring allows us to gather statistics/impressions of overall availability/quality-of-service.

1.1. History and current status

During 2011-2015, "Sander Maijers" <sander@clarin.eu> and "Dieter Van Uytvanck" <dieter@clarin.eu> maintained a Nagios server on ems04 that covered CLARIN-D related services hosted at MPI-PL. Since 2012, "Benedikt von St. Vith" <b.von.st.vieth@fz-juelich.de> has maintained a CLARIN-D wide Nagios hosted at FZJ?, and later Icinga 1.x, server on clarin.fz-juelich.de. During 2015, all CLARIN (i.e., ERIC) related checks have been migrated from ems04 to the latter host clarin.fz-juelich.de. The Icinga configuration on this host is controlled only indirectly by "Dieter, Sander and Willem" <sysops@clarin.eu> (CLARIN sysops), as described in the following sections. Direct system administration, e.g. Apache httpd and Icinga daemon state, as well as SSH access have been exclusive to Benedikt for FZJ policy reasons.

1.2. Road ahead

In 2016, Benedikt has set up a new host for CLARIN wide Icinga-based infra monitoringL fsd-cloud22?. This host has been configured as a replica of clarin.fz-juelich.de.

We now aim to realize the following changes:

  1. "Thomas Eckart" <teckart@informatik.uni-leipzig.de>, "Dirk Goldhahn" <dgoldhahn@informatik.uni-leipzig.de> from the ASV centre should take over service responsibility from Benedikt. The service will be administered technically by "Thomas Hynek" <hynek@informatik.uni-leipzig.de>. CLARIN sysops will maintain an advisory, secondary role on the technical level.
  2. The CLARIN sysops should have full administrative access over SSH to the monitoring host fsd-cloud22 for emergency maintenance. This includes being able to schedule downtime in Icinga.
  3. Spurious unavailability notifications/statuses should no longer occur. Missing unavailability notifications should be added.
  4. All hosts should allow pings (ICMP traffic). ICMP is a critical part of the Internet infrastructure. Filtering it yields no benefit, but can make it impossible to check whether a host is down or just a service running on it.

Subsidiary objectives are:

  1. Hosting the Icinga server in a Docker container, for easier testing and maintenance.
  2. A migration from Icinga 1.x to 2.x.
  3. Reduction of technical complexity.

1.2.1. Activities

Meetings

2. Current state

2.1. Functional information

2.1.1. Monitored types of systems

For each centre, the following systems should be monitored:

  • web services and applications
    • remote API
    • user interface
  • repositories
  • Handle servers (in case centers have their own, like IDS?)
  • sample PID URLs
  • websites

2.1.2. User groups

Centre technical contacts, as well as the CLARIN sysops, can view the status of centre services. Each centre's technical contact and possibly additional monitoring contacts, as declared in the Centre Registry, will get an e-mail notification of important status changes (e.g., sudden unavailability). Furthermore, each of these contacts can view a detailed Icinga status for their services. In addition, NagVis geographical overviews are available publicly for Germany and the world.

2.1.3. References

  1. Icinga
  2. NagVis geomap for CLARIN ERIC
  3. NagVis geomap for CLARIN-D
  4. A status map of Germany

2.1.4. Validity and availability checks

The Icinga server is set up to check availability (e.g., does an important HTTP URL work), validity (e.g., does an endpoint return well-formed and valid XML data?) and security (does the URL work with appropriate confidentiality and authentication, using e.g. TLS 1.2).

2.2. Technical information

Authentication is based on SAML. A custom SP interoperates with the CLARIN IdP for this. The eduPersonPrincipalName attribute is used for authorization. The value of this attribute is matched against the value recorded in the contact persons table of the Centre Registry. The CLARIN sysops have special access to views on all services. Centre technical contacts have access to views on their centre's services.

2.2.1. Contacts

Technical specialist on current Icinga setup
"Benedikt von St. Vith" <CLARIN-support@fz-juelich.de>

CLARIN technical contact person, some expertise on current Icinga setup
"Sander Maijers" <sander@clarin.eu>

2.2.2. References

  1. The Icinga configuration is managed through a Git repo on GitHub.
  2. Icinga configuration GitHub-host sync logs
  3. A barebones Icinga 1.x container that runs an Icinga daemon with the CLARIN ERIC and CLARIN-D configuration is available on GitHub.

2.2.3. Implementation details of validity and availability checks

To check whether a service that is up is not just returning undesired data, some validity checks are in place. The response of calling the Identify verb on an OAI-PMH endpoint is validated with a set of XSD schemas. FCS endpoints are currently not validated, though some SRU XSD schemas are available, because of problems with the approach. In addition, services that use TLS are only checked using the current version of TLS, v1.2. Exceptions are hard-coded and should be minimized.

2.2.4. Dependencies, environmental requirements

software function
NSCA Monitoring information from MPCDF. Requires Icinga UNIX socket. Important for CLARIN-D because of the distributed infrastructure.
IDO2DB writes Icinga information to MySQL
pnp4nagios no service, but needs access to Icinga outputs to create graphs.
npcd processes Icinga performance data for pnp4nagios.
shibd Shibboleth SP authentication for web interface.
PyNag Python library to process/manipulate Nagios/Icinga? 1.x configuration.
NagVis visualizes service status geographically.
NRPE package For checks of host-local properties, such as disk space, memory usage, etc.