wiki:SystemAdministration/Monitoring/Icinga/Outdated

1. CLARIN ERIC & CLARIN-D

We use Icinga, hosted and managed by Forschungszentrum Jülich. Contact: Benedikt von St. Vith <CLARIN-support@fz-juelich.de>

Users can view the status of number of centre services on NagVis?. Each centre's technical contact and possibly additional monitoring contacts, as declared in the Centre Registry, will get an e-mail notification of important status changes (e.g. outage). Furthermore, each of these contacts can view a detailed Icinga status page about their services. The CLARIN ERIC sysops <sysops@clarin.eu> have access to all of Icinga, except for scheduling downtime, and actions that require shell access and permissions (e.g. restarting the Icinga daemon). The latter is managed by Benedikt von St. Vith.

1.1. Validation of responses and security

To check whether a service that is up is not just returning undesired data, some validity checks are in place. The response of calling the Identify verb on an OAI-PMH endpoint is validated with a set of XSD schemas. FCS endpoints are currently not validated, though some SRU XSD schemas are available, because of problems with the approach. In addition, services that use TLS are only checked using the current version of TLS, v1.2. Exceptions are hard-coded and should be minimized.

1.2. URLs

1.2.1. Centre requirements

The monitoring requirements of CLARIN-D are pretty modest - e.g. a simple Icinga installation would be sufficient to fulfill these requirements.

  • SAML Servers Provider (IdP) status pages must be reachable from other servers
  • Regular checks if hosts are up and reachable (ping/ICMP).
  • Regular checks if essential networked services and applications are working (e.g. HTTP).
  • The server running the monitoring software is also monitored itself.

1.3. AAI

1.3.1. Requirements

As of September 2015, CLARIN-PLUS Task 2.1 covers work on improving monitoring of authentication and authorization infrastructure (CLARIN SPF & CLARIN Identity and Access Management).

(A ‘working’ endpoint is defined as one that does not return an error-signaling HTTP response code beyond 404 upon a request.)

We should monitor:

  • SAML metadata batches about SPF SPs and IdPs? for availability, security, etc. IMPLEMENTED.
  • Discovery Service availability. IMPLEMENTED.
  • Upstream identity federation SAML metadata batches about IdPs? for availability, security, etc. IMPLEMENTED, NEEDS IMPROVEMENT.
  • SAML metadata batches about SPF IdPs?, whether they continually contain all IdPs? in each identity federation.
  • Once all SAML Service Provider status pages of SPF SPs are reachable from outside, these pages for availability and whether they work.
  • Correct operation of SPs:
    • Does the login page work?
    • Do all endpoints work?

1.4. Locally at centres

1.4.1. Guidelines

  • web services and applications
    • remote API
    • user interface
  • repositories
  • Handle servers (in case centers have their own, like IDS?)
  • sample PID URLs
  • websites

1.5. Resources

1.5.1. AAI

1.6. Issues to be resolved

  1. Technical
    1. Excessive complexity
      • Inability for non-admins to test configuration changes without pushing them to the Git repo.
      • Complicated workflow in which custom software (PyNag?-based) of substantial complexity mutates the configuration.
      • Complicated shell scripts and tools to perform probes.
      • Icinga 1.x configuration is very verbose and error prone (e.g. whitespace, lots of duplication)
      • Dependency on a lot of tools/extensions (see: Dependencies of current setup).
  2. Organizational
    • No policy for when a service isunavailable. Sometimes no follow-up, never follow-up by a single, responsible person.

1.7. Work plan

  • Migrate from Icinga 1.x to Icinga 2 daemon and Icinga Web 2 frontend.
    • The configuration can be migrated automatically, with some manual cleanup afterwards.
  • Reduce the large set of dependencies external to standard Icinga 2 distribution to minimum.
  • Use (Docker) container based virtualized software stack for easy (re)deployment and testing.
  • Simplify configuration workflow, currently described in the README for GitHub repository for the current configuration.
    • Use protected branch for successfully tested production configuration.

NOTE: the following information is outdated and irrelevant now. All CLARIN ERIC centres have the services monitored centrally. This is kept for reference to those centres interested in (older) separate plugins or the historical work.

A centre can monitor its own services. The following example monitoring plugins in Python 2 can assess SRU/CQL and OAI-PMH endpoints. https://trac.clarin.eu/browser/monitoring/plugins/mpi

Service Types / Tests ping HTTP disk space load free mem users functional check query duration
AAI Service Providers (SP) * # #(IDS probe?)
AAI Identity Providers (IdP) * # * * #(IDS probe?)
AAI Where are you From (WAYF) * # #(MPI discojuice probe?)
REST-Webservices (WebLicht?) * #(provenance data from TCF?)
Federated Content Search Endpoints (SRU/CQL) * # #(MPI probe?)
Federated Content Search Aggregator * # #
Repositories * # * #(test for afedora content model?)
OAI-PMH Gateway * #(MPI probe?)
Handle Servers * #(EUDAT/Jülich probe?) #(Eric's timeout probe)
resolve a sample PID for each repository # #
Centre Registry * #
WebLicht? webserver * #
other webservers * #
Nagios servers (selfcheck) * # #(check_nagios plugin)
Nagios servers crosscheck (from other centre) * #(check_nagios plugin)
Workspaces server (not yet) n.a. n.a. n.a.

# mandatory; * optional


Last modified 8 years ago Last modified on 02/23/16 12:16:33