[[PageOutline]] = Monitoring CLARIN infra using [https://www.icinga.org Icinga] # Introduction # The CLARIN infrastructure of networked services and applications is being monitored constantly. The infrastructure can be divided into '''centre-managed''' and '''centrally managed'''. The automatic monitoring of services meets two goals: 1. '''Unavailability''' or '''invalidity''' of endpoints/frontends is detected, communicated and corrected as soon as possible. 2. Statistics/impressions of overall availability/quality-of-service can be gathered. Our [https://www.icinga.org Icinga] monitoring system does this by periodically launching '''checks''' against all of our endpoints and hosts. Icinga's configuration is managed through a Git repo [https://github.com/clarin-eric/monitoring on GitHub]. Checks are parametrized '''probes'''. Probes are small set of executable commands, i.e. some basic builtin command (e.g., that checks whether a HTTP endpoint is up) or some monitoring plugin that can be executed as command line tool. ## Roles ## The system is primarily administered by the ASV . CLARIN sysops maintain an advisory, secondary role on the technical level. The ASV system administrator(s) and the CLARIN sysops (Sander, Dieter, Willem) have full administrative access over SSH to the monitoring host for emergency maintenance. This includes being able to schedule downtime in Icinga. ## Links ## For internal discussion, Dieter (CLARIN), Thomas Eckart (ASV), Nathanael Philipp (ASV) and Sander (MPI-PL) are reachable via monitoring@clarin.eu. The public link to the monitoring is [https://monitoring.clarin.eu/]. If you have justified reasons for accessing the monitoring, your CLARIN IdP account can be added to the configuration. [wiki:SystemAdministration/Hosts/fsd-cloud22.zam.kfa-juelich.de Some more details]. # Activities # ## Tickets ## [[TicketQuery(col=ticket|priority|summary|owner|created|modified,component=Monitoring,order=modified,desc=true,table)]] ## Meetings ## [[./Meetings/2016-02-18]] [[./Meetings/2016-04-04]] # Details # A detailed technical description of the workflow from GitHub to the monitoring host is given in [https://github.com/clarin-eric/monitoring/blob/master/README.md README]. ## Monitored types of systems ## For each centre, the following systems should be monitored: * web services and applications * remote API * user interface * repositories * Handle servers (in case centers have their own, like IDS?) * sample PID URLs * websites In addition, all central services should be monitored both at frontend and backends, if any. ## Checks ## The Icinga server is set up to check '''availability''' (e.g., does an important HTTP URL work), '''validity''' (e.g., does an endpoint return well-formed and valid XML data?) and '''security''' (does the URL work with appropriate confidentiality and authentication, using e.g. TLS 1.2). To check whether a service that is up is not just returning undesired data, some validity checks are in place. The response of calling the Identify verb on an OAI-PMH endpoint is validated with a set of XSD schemas. FCS endpoints are currently not validated, though some SRU XSD schemas are available, because of problems with the approach. In addition, services that use TLS are only checked using the current version of TLS, v1.2. Exceptions are hard-coded and should be minimized. ## Users ## Centre technical contacts, as well as the CLARIN sysops, can view the status of centre services. Each centre's technical contact and possibly additional monitoring contacts, as declared in the [wiki:"Centre Registry" Centre Registry], will get an e-mail notification of important status changes (e.g., sudden unavailability). Furthermore, each of these contacts can view a detailed Icinga status for their services. In addition, !NagVis geographical overviews are available publicly for [http://clarin-d.net/de/aktuell/status-infrastruktur Germany] and [https://fsd-cloud22.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-EU_Geomap the world]. ## Access ## Authentication is based on SAML. A [https://fsd-cloud22.fz-juelich.de/Shibboleth.sso/Metadata custom SP] interoperates with the CLARIN IdP for this. The `eduPersonPrincipalName` attribute is used for authorization. The value of this attribute is matched against the value recorded in the contact persons table of the Centre Registry. The ASV administrators and CLARIN sysops have special access to views on all services. Centre technical contacts have access to views on their centre's services. ## Dependencies, system requirements ## ||= '''software''' =||= '''function''' =|| || [http://docs.icinga.org/latest/en/nsca.html NSCA] || Monitoring information from MPCDF. Requires Icinga UNIX socket. Important for CLARIN-D because of the distributed infrastructure. || || [http://docs.icinga.org/latest/en/configido.html IDO2DB] || writes Icinga information to MySQL || || [https://docs.pnp4nagios.org/ pnp4nagios] || no service, but needs access to Icinga outputs to create graphs. || || [https://wiki.icinga.org/display/howtos/Setting+up+PNP+with+Icinga npcd] || processes Icinga performance data for pnp4nagios. || || [https://shibboleth.net/products/service-provider.html shibd] || Shibboleth SP authentication for web interface. || || [http://pynag.org/ PyNag] || Python library to process/manipulate Nagios/Icinga 1.x configuration. || || [http://www.nagvis.org/ NagVis] || visualizes service status geographically. || || [http://docs.icinga.org/latest/en/nrpe.html NRPE package] || For checks of host-local properties, such as disk space, memory usage, etc. || # History # During 2011-2015, [[mailto:"Sander Maijers" ]] and [[mailto:"Dieter Van Uytvanck" ]] maintained a Nagios server on [wiki:SystemAdministration/Hosts/ems04.mpi.nl ems04] that covered CLARIN-D related services hosted at [wiki:SystemAdministration/Hosters/MPI-PL MPI-PL]. Since 2012, [[mailto:"Benedikt von St. Vith" ]] has maintained a CLARIN-D wide Nagios hosted at [wiki:SystemAdministration/Hosters/FZJ FZJ], and later Icinga 1.x, server on [wiki:SystemAdministration/Hosts/clarin.fz-juelich.de clarin.fz-juelich.de]. During 2015, all CLARIN (i.e., ERIC) related checks have been migrated from `ems04` to the latter host `clarin.fz-juelich.de`. The Icinga configuration on this host is controlled only indirectly by [[mailto:"Dieter, Sander and Willem" ]] (CLARIN sysops), as described in the following sections. Direct system administration, e.g. Apache httpd and Icinga daemon state, as well as SSH access have been exclusive to Benedikt for FZJ policy reasons. In 2016, Benedikt set up a new host for CLARIN wide Icinga-based infra monitoring `fsd-cloud22`. This host has been configured as a replica of `clarin.fz-juelich.de`. [[mailto:"Thomas Eckart" ]], [[mailto:"Dirk Goldhahn" ]] from the [https://centres.clarin.eu/centre/4 ASV centre] took over service responsibility from Benedikt.