Changes between Version 7 and Version 8 of SystemAdministration/Monitoring/Icinga


Ignore:
Timestamp:
03/31/16 17:32:13 (8 years ago)
Author:
Sander Maijers
Comment:

contents: overhaul

Legend:

Unmodified
Added
Removed
Modified
  • SystemAdministration/Monitoring/Icinga

    v7 v8  
     1[[PageOutline]]
     2
    13= Monitoring CLARIN infra using [https://www.icinga.org Icinga]
    24
    3 # Planning #
     5# Introduction #
    46
    5 The extensive CLARIN technical infrastructure, centrally as well as at centres, should be monitored constantly. This serves two goals:
    6 1. Unavailabilty is detected, communicated and corrected as soon as possible.
    7 2. Automatic monitoring allows us to gather statistics/impressions of overall availability/quality-of-service.
     7The CLARIN infrastructure of networked services and applications is being monitored constantly. The infrastructure can be divided into '''centre-managed''' and '''centrally managed'''.
    88
    9 [[PageOutline(1-5,Table of Contents,pullout)]]
     9The automatic monitoring of services meets two goals:
    1010
    11 ## History and current status ##
     111. '''Unavailability''' or '''invalidity''' of endpoints/frontends is detected, communicated and corrected as soon as possible.
     122. Statistics/impressions of overall availability/quality-of-service can be gathered.
    1213
    13 During 2011-2015, [[mailto:"Sander Maijers" <sander@clarin.eu>]] and [[mailto:"Dieter Van Uytvanck" <dieter@clarin.eu>]] maintained a Nagios server on [wiki:SystemAdministration/Hosts/ems04.mpi.nl ems04] that covered CLARIN-D related services hosted at [wiki:SystemAdministration/Hosters/MPI-PL MPI-PL]. Since 2012, [[mailto:"Benedikt von St. Vith" <b.von.st.vieth@fz-juelich.de>]] has maintained a CLARIN-D wide Nagios hosted at [wiki:SystemAdministration/Hosters/FZJ FZJ], and later Icinga 1.x, server on [wiki:SystemAdministration/Hosts/clarin.fz-juelich.de clarin.fz-juelich.de]. During 2015, all CLARIN (i.e., ERIC) related checks have been migrated from `ems04` to the latter host `clarin.fz-juelich.de`. The Icinga configuration on this host is controlled only indirectly by [[mailto:"Dieter, Sander and Willem" <sysops@clarin.eu>]] (CLARIN sysops), as described in the following sections. Direct system administration, e.g. Apache httpd and Icinga daemon state, as well as SSH access have been exclusive to Benedikt for FZJ policy reasons.
     14Our [https://www.icinga.org Icinga] monitoring system does this by periodically launching '''checks''' against all of our endpoints and hosts. Icinga's configuration is managed through a Git repo [https://github.com/clarin-eric/monitoring on GitHub]. Checks are parametrized '''probes'''. Probes are small set of executable commands, i.e. some basic builtin command (e.g., that checks whether a HTTP endpoint is up) or some monitoring plugin that can be executed as command line tool. Currently, almost all probes we use are handled by the [https://curl.haxx.se curl] utility, controlled via [https://github.com/clarin-eric/monitoring/blob/master/probes/probe_curl.sh a central script].
    1415
    15 ## Road ahead ##
     16## Roles ##
    1617
    17 In 2016, Benedikt has set up a new host for CLARIN wide Icinga-based infra monitoringL [wiki:SystemAdministration/Hosts/fsd-cloud22.fz-juelich.de fsd-cloud22]. This host has been configured as a replica of `clarin.fz-juelich.de`.
     18The system is primarily administered  by the ASV in the person of [[mailto:"Thomas Hynek" <hynek@informatik.uni-leipzig.de>]]. CLARIN sysops maintain an advisory, secondary role on the technical level. The ASV system administrator(s) and the CLARIN sysops (Sander, Dieter, Willem) have full administrative access over SSH to the monitoring host for emergency maintenance. This includes being able to schedule downtime in Icinga.
    1819
    19 We now aim to realize the following changes:
    20 1. [[mailto:"Thomas Eckart" <teckart@informatik.uni-leipzig.de>]], [[mailto:"Dirk Goldhahn" <dgoldhahn@informatik.uni-leipzig.de>]] from the [https://centres.clarin.eu/centre/4 ASV centre] should take over service responsibility from Benedikt. The service will be administered technically by [[mailto:"Thomas Hynek" <hynek@informatik.uni-leipzig.de>]]. CLARIN sysops will maintain an advisory, secondary role on the technical level.
    21 1. The CLARIN sysops should have full administrative access over SSH to the monitoring host `fsd-cloud22` for emergency maintenance. This includes being able to schedule downtime in Icinga.
    22 1. Spurious unavailability notifications/statuses should no longer occur. Missing unavailability notifications should be added.
    23 1. All hosts should allow pings (ICMP traffic). ICMP is a critical part of the Internet infrastructure. Filtering it yields no benefit, but can make it impossible to check whether a host is down or just a service running on it.
     20All requests and issues relating to monitoring should be processed through the Trac component Monitoring (see ‘Tickets’ below).
    2421
    25 Subsidiary objectives are:
    26 1. Hosting the Icinga server in a Docker container, for easier testing and maintenance.
    27 1. A migration from Icinga 1.x to 2.x.
    28 1. Reduction of technical complexity.
     22## Links ##
    2923
    30 ### Activities ###
    31 '''Meetings'''\\
    32  * [wiki:./Meetings/20160218]
     24For internal discussion, Benedikt (FZJ), Daniel (ASV), Dieter (CLARIN), Thomas Eckart (ASV), Thomas Hynek (ASV) and Sander (MPI-PL) are currently reachable via monitoring@clarin.eu.
    3325
    34 # Current state #
     26To view the current monitoring state, browse the [https://fsd-cloud22.fz-juelich.de/icinga/ Icinga frontend].
    3527
    36 ## Functional information ##
     28[wiki:SystemAdministration/Hosts/fsd-cloud22.zam.kfa-juelich.de About the monitoring host fsd-cloud22].
    3729
    38 ### Monitored types of systems ###
     30To see how changes to the Git repo have been propagated to the monitoring host, see [https://fsd-cloud22.fz-juelich.de:7011/logs/ Icinga configuration GitHub-host sync logs].
     31
     32# Activities #
     33
     34## Tickets ##
     35
     36[[TicketQuery(col=ticket|priority|summary|owner|created|modified,component=Monitoring,order=modified,desc=true,table)]]
     37
     38## Meetings ##
     39[[./Meetings/2016-02-18]]
     40
     41[[./Meetings/2016-04-04]]
     42
     43# Details #
     44
     45## Monitored types of systems ##
    3946
    4047For each centre, the following systems should be monitored:
     
    4754 * websites
    4855
    49 ### User groups ###
     56In addition, all central services should be monitored both at frontend and backends, if any.
    5057
    51 Centre technical contacts, as well as the CLARIN sysops, can view the status of centre services. Each centre's technical contact and possibly additional monitoring contacts, as declared in the [wiki:"Centre Registry" Centre Registry], will get an e-mail notification of important status changes (e.g., sudden unavailability). Furthermore, each of these contacts can view a detailed Icinga status for their services.
    52 In addition, !NagVis geographical overviews are available publicly for Germany and the world.
     58## Checks ##
    5359
    54 ### References ###
    55 
    56 1. [https://fsd-cloud22.fz-juelich.de/icinga/ Icinga]
    57 1. [https://fsd-cloud22.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-EU_Geomap NagVis geomap for CLARIN ERIC]
    58 1. [https://fsd-cloud22.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-DE_Service_Overview NagVis geomap for CLARIN-D]
    59 1. A [http://clarin-d.net/de/aktuell/status-infrastruktur status map of Germany]
    60 
    61 ### Validity and availability checks ###
    6260The Icinga server is set up to check '''availability''' (e.g., does an important HTTP URL work), '''validity''' (e.g., does an endpoint return well-formed and valid XML data?) and '''security''' (does the URL work with appropriate confidentiality and authentication, using e.g. TLS 1.2).
    6361
    64 ## Technical information ##
    65 
    66 Authentication is based on SAML. A [https://fsd-cloud22.fz-juelich.de/Shibboleth.sso/Metadata custom SP] interoperates with the CLARIN IdP for this. The eduPersonPrincipalName attribute is used for authorization. The value of this attribute is matched against the value recorded in the contact persons table of the Centre Registry. The CLARIN sysops have special access to views on all services. Centre technical contacts have access to views on their centre's services.
    67 
    68 ### Contacts ###
    69 
    70 '''Technical specialist on current Icinga setup''' \\
    71 [[mailto:"Benedikt von St. Vith" <CLARIN-support@fz-juelich.de>]]
    72 
    73 '''CLARIN technical contact person, some expertise on current Icinga setup''' \\
    74 [[mailto:"Sander Maijers" <sander@clarin.eu>]]
    75 
    76 ### References ###
    77 
    78 1. The Icinga configuration is managed through a Git repo [https://github.com/clarin-eric/monitoring on GitHub].
    79 1. [https://fsd-cloud22.fz-juelich.de:7011/logs/ Icinga configuration GitHub-host sync logs]
    80 1. A barebones Icinga 1.x container that runs an Icinga daemon with the CLARIN ERIC and CLARIN-D configuration is available [https://github.com/clarin-eric/virtual_debian-monitoring on GitHub].
    81 
    82 ### Implementation details of validity and availability checks ###
    8362To check whether a service that is up is not just returning undesired data, some validity checks are in place. The response of calling the Identify verb on an OAI-PMH endpoint is validated with a set of XSD schemas. FCS endpoints are currently not validated, though some SRU XSD schemas are available, because of problems with the approach. In addition, services that use TLS are only checked using the current version of TLS, v1.2. Exceptions are hard-coded and should be minimized.
    8463
    85 ### Dependencies, environmental requirements ###
     64## Users ##
    8665
    87 || '''software''' || '''function''' ||
     66Centre technical contacts, as well as the CLARIN sysops, can view the status of centre services. Each centre's technical contact and possibly additional monitoring contacts, as declared in the [wiki:"Centre Registry" Centre Registry], will get an e-mail notification of important status changes (e.g., sudden unavailability). Furthermore, each of these contacts can view a detailed Icinga status for their services.
     67In addition, !NagVis geographical overviews are available publicly for [http://clarin-d.net/de/aktuell/status-infrastruktur Germany] and [https://fsd-cloud22.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-EU_Geomap the world].
     68
     69## Access ##
     70Authentication is based on SAML. A [https://fsd-cloud22.fz-juelich.de/Shibboleth.sso/Metadata custom SP] interoperates with the CLARIN IdP for this. The `eduPersonPrincipalName` attribute is used for authorization. The value of this attribute is matched against the value recorded in the contact persons table of the Centre Registry. The ASV administrators and CLARIN sysops have special access to views on all services. Centre technical contacts have access to views on their centre's services.
     71
     72## Dependencies, system requirements ##
     73
     74||= '''software''' =||= '''function''' =||
    8875|| [http://docs.icinga.org/latest/en/nsca.html NSCA] || Monitoring information from MPCDF. Requires Icinga UNIX socket. Important for CLARIN-D because of the distributed infrastructure. ||
    8976|| [http://docs.icinga.org/latest/en/configido.html IDO2DB] || writes Icinga information to MySQL ||
     
    9481|| [http://www.nagvis.org/ NagVis] || visualizes service status geographically. ||
    9582|| [http://docs.icinga.org/latest/en/nrpe.html NRPE package] || For checks of host-local properties, such as disk space, memory usage, etc. ||
     83
     84# History #
     85
     86During 2011-2015, [[mailto:"Sander Maijers" <sander@clarin.eu>]] and [[mailto:"Dieter Van Uytvanck" <dieter@clarin.eu>]] maintained a Nagios server on [wiki:SystemAdministration/Hosts/ems04.mpi.nl ems04] that covered CLARIN-D related services hosted at [wiki:SystemAdministration/Hosters/MPI-PL MPI-PL]. Since 2012, [[mailto:"Benedikt von St. Vith" <b.von.st.vieth@fz-juelich.de>]] has maintained a CLARIN-D wide Nagios hosted at [wiki:SystemAdministration/Hosters/FZJ FZJ], and later Icinga 1.x, server on [wiki:SystemAdministration/Hosts/clarin.fz-juelich.de clarin.fz-juelich.de]. During 2015, all CLARIN (i.e., ERIC) related checks have been migrated from `ems04` to the latter host `clarin.fz-juelich.de`. The Icinga configuration on this host is controlled only indirectly by [[mailto:"Dieter, Sander and Willem" <sysops@clarin.eu>]] (CLARIN sysops), as described in the following sections. Direct system administration, e.g. Apache httpd and Icinga daemon state, as well as SSH access have been exclusive to Benedikt for FZJ policy reasons. In 2016, Benedikt set up a new host for CLARIN wide Icinga-based infra monitoring `fsd-cloud22`. This host has been configured as a replica of `clarin.fz-juelich.de`. [[mailto:"Thomas Eckart" <teckart@informatik.uni-leipzig.de>]], [[mailto:"Dirk Goldhahn" <dgoldhahn@informatik.uni-leipzig.de>]] from the [https://centres.clarin.eu/centre/4 ASV centre] took over service responsibility from Benedikt.