wiki:SystemAdministration/Monitoring/Icinga/Outdated

Version 84 (modified by Sander Maijers, 9 years ago) (diff)

Some edits to Future setup

Current setup

CLARIN ERIC & CLARIN-D

We use Icinga, hosted and managed by Forschungszentrum Jülich. Contact: Benedikt von St. Vith <CLARIN-support@fz-juelich.de>

Users can view the status of each centre and its service(s) on NagVis?. Each centre's technical contact and possibly additional monitoring contacts, as declared in the Centre Registry, will get an e-mail notification of important status changes (e.g. outage). Furthermore, each of these contacts can view a detailed Icinga status page about their services. The CLARIN ERIC sysops <sysops@clarin.eu> have access to all of Icinga, except for scheduling downtime, and actions that require shell access and permissions (e.g. restarting the Icinga daemon). The latter is managed by Benedikt von St. Vith.

URLs

Software stack

Icinga 1.x. (Nagios fork, same format for plugins etc): https://www.icinga.org/

A barebones Icinga 1.x container that runs an Icinga daemon with the CLARIN ERIC and CLARIN-D configuration is available at: https://github.com/clarin-eric/virtual_debian-monitoring

Configuration

The current configuration is managed through https://github.com/clarin-eric/monitoring.

Dependencies of current setup

  • NSCA: Monitoring information from RZG. Requires Icinga UNIX socket. Important for CLARIN-D because of the distributed infrastructure.
  • IDO2DB: Writing Icinga information to MySQL
  • php4nagios: No service, but needs access to Icinga outputs to create graphs.
  • npcd: Processing Icinga performance data for pnp4nagios.
  • shibd: Shibboleth authentication for web interface.
  • PyNag?: Python library to process/manipulate Nagios/Icinga? 1.x configuration.
  • NagVis?: Used to visualise service status geographically.

CLARIN-D hosts of which the services & status are to be checked will need to have the Nagios NRPE package (only for local checks like status of disk space, memory, etc.) and port 5666 open.

Centre requirements

The monitoring requirements of CLARIN-D are pretty modest - e.g. a simple Icinga installation would be sufficient to fulfill these requirements.

  • SAML Servers Provider (IdP) status pages must be reachable from other servers
  • Regular checks if hosts are up and reachable (ping/ICMP).
  • Regular checks if essential networked services and applications are working (e.g. HTTP).
  • The server running the monitoring software is also monitored itself.

AAI

Requirements

As of September 2015, CLARIN-PLUS Task 2.1 covers work on improving monitoring of AAI infrastructure (CLARIN SPF & CLARIN Identity and Access Management). A ‘working’ endpoint is defined as one that does not return an errors-signaling HTTP status code beyond 404.

We should monitor:

  • SAML metadata batches about SPF SPs and IdPs? for availability, security, etc. IMPLEMENTED.
  • Discovery Service availability. IMPLEMENTED.
  • Upstream identity federation SAML metadata batches about IdPs? for availability, security, etc. IMPLEMENTED, NEEDS IMPROVEMENT.
  • SAML metadata batches about SPF IdPs?, whether they continually contain all IdPs? in each identity federation.
  • Once all SAML Service Provider status pages of SPF SPs are reachable from outside, these pages for availability and whether they work.
  • Correct operation of SPs:
    • Does the login page work?
    • Do all endpoints work?

Locally at centres

Guidelines

  • web services and applications
    • remote API
    • user interface
  • repositories
  • Handle servers (in case centers have their own, like IDS?)
  • sample PID URLs
  • websites

Resources

AAI

Future setup

  • Migrate from Icinga 1.x to Icinga 2 daemon and Icinga Web 2 frontend.
    • The configuration can be migrated automatically, with some manual cleanup afterwards.
  • Reduce the large set of dependencies external to standard Icinga 2 distribution to minimum.
  • Use (Docker) container based virtualized software stack for easy (re)deployment and testing.
  • Simplify configuration workflow, currently described in the README for GitHub repository for the current configuration.
    • Use protected branch for successfully tested production configuration.

NOTE: the following information is outdated and irrelevant now. All CLARIN ERIC centres have the services monitored centrally. This is kept for reference to those centres interested in (older) separate plugins or the historical work.

A centre can monitor its own services. The following example monitoring plugins in Python 2 can assess SRU/CQL and OAI-PMH endpoints. https://trac.clarin.eu/browser/monitoring/plugins/mpi

Service Types / Tests ping http disk space load free mem users functional check query duration time
AAI Service Providers (SP) * # #(IDS probe?)
AAI Identity Providers (IdP) * # * * #(IDS probe?)
AAI Where are you From (WAYF) * # #(MPI discojuice probe?)
REST-Webservices (WebLicht?) * #(provenance data aus TCF?)
Federated Content Search Endpoints (SRU/CQL) * # #(MPI probe?)
Federated Content Search Aggregator * # #
Repositories * # * #(test for afedora content model?)
OAI-PMH Gateway * #(MPI probe?)
Handle Servers * #(EUDAT/Jülich probe?) #(Eric's timeout probe)
resolve a sample PID for each repository # #
Centre Registry * #
WebLicht? webserver * #
other webservers * #
Nagios servers (selfcheck) * # #(check_nagios plugin)
Nagios servers crosscheck (from other centre) * #(check_nagios plugin)
Workspaces server (not yet) n.a. n.a. n.a.

# mandatory; * optional