wiki:SystemAdministration/Monitoring/Icinga

Monitoring CLARIN infra using Icinga

1. Introduction

The CLARIN infrastructure of networked services and applications is being monitored constantly. The infrastructure can be divided into centre-managed and centrally managed.

The automatic monitoring of services meets two goals:

  1. Unavailability or invalidity of endpoints/frontends is detected, communicated and corrected as soon as possible.
  2. Statistics/impressions of overall availability/quality-of-service can be gathered.

Our Icinga monitoring system does this by periodically launching checks against all of our endpoints and hosts. Icinga's configuration is managed through a Git repo on GitHub. Checks are parametrized probes. Probes are small set of executable commands, i.e. some basic builtin command (e.g., that checks whether a HTTP endpoint is up) or some monitoring plugin that can be executed as command line tool.

1.1. Roles

The system is primarily administered by the ASV . CLARIN sysops maintain an advisory, secondary role on the technical level. The ASV system administrator(s) and the CLARIN sysops (Sander, Dieter, Willem) have full administrative access over SSH to the monitoring host for emergency maintenance. This includes being able to schedule downtime in Icinga.

For internal discussion, Dieter (CLARIN), Thomas Eckart (ASV), Nathanael Philipp (ASV) and Sander (MPI-PL) are reachable via monitoring@clarin.eu.

The public link to the monitoring is https://monitoring.clarin.eu/. If you have justified reasons for accessing the monitoring, your CLARIN IdP account can be added to the configuration.

Some more details.

2. Activities

2.1. Tickets

Ticket Priority Summary Owner Created Modified
#905 minor Make OAI-PMH check less dependent on availability of externally hosted XSDs. Dirk Goldhahn 8 years ago 7 years ago
#992 major Implement working checks for aai1.clarin.eu and aai2.clarin.eu alexander.richter@uni-leipzig.de 7 years ago 7 years ago
#993 major Investigate and fix error for discovery.clarin.eu alexander.richter@uni-leipzig.de 7 years ago 7 years ago
#982 minor False alert for resolution of catalog.clarin.eu Dirk Goldhahn 8 years ago 8 years ago
#930 major Force endpoint XSD validation to only use cached XSDs Dirk Goldhahn 8 years ago 8 years ago
#952 major www.clarin.eu not reachable via ICMP Sander Maijers 8 years ago 8 years ago

2.2. Meetings

Meetings/2016-02-18

Meetings/2016-04-04

3. Details

A detailed technical description of the workflow from GitHub to the monitoring host is given in README.

3.1. Monitored types of systems

For each centre, the following systems should be monitored:

  • web services and applications
    • remote API
    • user interface
  • repositories
  • Handle servers (in case centers have their own, like IDS?)
  • sample PID URLs
  • websites

In addition, all central services should be monitored both at frontend and backends, if any.

3.2. Checks

The Icinga server is set up to check availability (e.g., does an important HTTP URL work), validity (e.g., does an endpoint return well-formed and valid XML data?) and security (does the URL work with appropriate confidentiality and authentication, using e.g. TLS 1.2).

To check whether a service that is up is not just returning undesired data, some validity checks are in place. The response of calling the Identify verb on an OAI-PMH endpoint is validated with a set of XSD schemas. FCS endpoints are currently not validated, though some SRU XSD schemas are available, because of problems with the approach. In addition, services that use TLS are only checked using the current version of TLS, v1.2. Exceptions are hard-coded and should be minimized.

3.3. Users

Centre technical contacts, as well as the CLARIN sysops, can view the status of centre services. Each centre's technical contact and possibly additional monitoring contacts, as declared in the Centre Registry, will get an e-mail notification of important status changes (e.g., sudden unavailability). Furthermore, each of these contacts can view a detailed Icinga status for their services. In addition, NagVis geographical overviews are available publicly for Germany and the world.

3.4. Access

Authentication is based on SAML. A custom SP interoperates with the CLARIN IdP for this. The eduPersonPrincipalName attribute is used for authorization. The value of this attribute is matched against the value recorded in the contact persons table of the Centre Registry. The ASV administrators and CLARIN sysops have special access to views on all services. Centre technical contacts have access to views on their centre's services.

3.5. Dependencies, system requirements

software function
NSCA Monitoring information from MPCDF. Requires Icinga UNIX socket. Important for CLARIN-D because of the distributed infrastructure.
IDO2DB writes Icinga information to MySQL
pnp4nagios no service, but needs access to Icinga outputs to create graphs.
npcd processes Icinga performance data for pnp4nagios.
shibd Shibboleth SP authentication for web interface.
PyNag Python library to process/manipulate Nagios/Icinga? 1.x configuration.
NagVis visualizes service status geographically.
NRPE package For checks of host-local properties, such as disk space, memory usage, etc.

4. History

During 2011-2015, "Sander Maijers" <sander@clarin.eu> and "Dieter Van Uytvanck" <dieter@clarin.eu> maintained a Nagios server on ems04 that covered CLARIN-D related services hosted at MPI-PL. Since 2012, "Benedikt von St. Vith" <b.von.st.vieth@fz-juelich.de> has maintained a CLARIN-D wide Nagios hosted at FZJ?, and later Icinga 1.x, server on clarin.fz-juelich.de. During 2015, all CLARIN (i.e., ERIC) related checks have been migrated from ems04 to the latter host clarin.fz-juelich.de. The Icinga configuration on this host is controlled only indirectly by "Dieter, Sander and Willem" <sysops@clarin.eu> (CLARIN sysops), as described in the following sections. Direct system administration, e.g. Apache httpd and Icinga daemon state, as well as SSH access have been exclusive to Benedikt for FZJ policy reasons. In 2016, Benedikt set up a new host for CLARIN wide Icinga-based infra monitoring fsd-cloud22. This host has been configured as a replica of clarin.fz-juelich.de. "Thomas Eckart" <teckart@informatik.uni-leipzig.de>, "Dirk Goldhahn" <dgoldhahn@informatik.uni-leipzig.de> from the ASV centre took over service responsibility from Benedikt.

Last modified 3 years ago Last modified on 05/28/21 13:17:58