Changes between Version 7 and Version 8 of SystemAdministration/Monitoring/Icinga
- Timestamp:
- 03/31/16 17:32:13 (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SystemAdministration/Monitoring/Icinga
v7 v8 1 [[PageOutline]] 2 1 3 = Monitoring CLARIN infra using [https://www.icinga.org Icinga] 2 4 3 # Planning#5 # Introduction # 4 6 5 The extensive CLARIN technical infrastructure, centrally as well as at centres, should be monitored constantly. This serves two goals: 6 1. Unavailabilty is detected, communicated and corrected as soon as possible. 7 2. Automatic monitoring allows us to gather statistics/impressions of overall availability/quality-of-service. 7 The CLARIN infrastructure of networked services and applications is being monitored constantly. The infrastructure can be divided into '''centre-managed''' and '''centrally managed'''. 8 8 9 [[PageOutline(1-5,Table of Contents,pullout)]] 9 The automatic monitoring of services meets two goals: 10 10 11 ## History and current status ## 11 1. '''Unavailability''' or '''invalidity''' of endpoints/frontends is detected, communicated and corrected as soon as possible. 12 2. Statistics/impressions of overall availability/quality-of-service can be gathered. 12 13 13 During 2011-2015, [[mailto:"Sander Maijers" <sander@clarin.eu>]] and [[mailto:"Dieter Van Uytvanck" <dieter@clarin.eu>]] maintained a Nagios server on [wiki:SystemAdministration/Hosts/ems04.mpi.nl ems04] that covered CLARIN-D related services hosted at [wiki:SystemAdministration/Hosters/MPI-PL MPI-PL]. Since 2012, [[mailto:"Benedikt von St. Vith" <b.von.st.vieth@fz-juelich.de>]] has maintained a CLARIN-D wide Nagios hosted at [wiki:SystemAdministration/Hosters/FZJ FZJ], and later Icinga 1.x, server on [wiki:SystemAdministration/Hosts/clarin.fz-juelich.de clarin.fz-juelich.de]. During 2015, all CLARIN (i.e., ERIC) related checks have been migrated from `ems04` to the latter host `clarin.fz-juelich.de`. The Icinga configuration on this host is controlled only indirectly by [[mailto:"Dieter, Sander and Willem" <sysops@clarin.eu>]] (CLARIN sysops), as described in the following sections. Direct system administration, e.g. Apache httpd and Icinga daemon state, as well as SSH access have been exclusive to Benedikt for FZJ policy reasons.14 Our [https://www.icinga.org Icinga] monitoring system does this by periodically launching '''checks''' against all of our endpoints and hosts. Icinga's configuration is managed through a Git repo [https://github.com/clarin-eric/monitoring on GitHub]. Checks are parametrized '''probes'''. Probes are small set of executable commands, i.e. some basic builtin command (e.g., that checks whether a HTTP endpoint is up) or some monitoring plugin that can be executed as command line tool. Currently, almost all probes we use are handled by the [https://curl.haxx.se curl] utility, controlled via [https://github.com/clarin-eric/monitoring/blob/master/probes/probe_curl.sh a central script]. 14 15 15 ## Ro ad ahead##16 ## Roles ## 16 17 17 In 2016, Benedikt has set up a new host for CLARIN wide Icinga-based infra monitoringL [wiki:SystemAdministration/Hosts/fsd-cloud22.fz-juelich.de fsd-cloud22]. This host has been configured as a replica of `clarin.fz-juelich.de`. 18 The system is primarily administered by the ASV in the person of [[mailto:"Thomas Hynek" <hynek@informatik.uni-leipzig.de>]]. CLARIN sysops maintain an advisory, secondary role on the technical level. The ASV system administrator(s) and the CLARIN sysops (Sander, Dieter, Willem) have full administrative access over SSH to the monitoring host for emergency maintenance. This includes being able to schedule downtime in Icinga. 18 19 19 We now aim to realize the following changes: 20 1. [[mailto:"Thomas Eckart" <teckart@informatik.uni-leipzig.de>]], [[mailto:"Dirk Goldhahn" <dgoldhahn@informatik.uni-leipzig.de>]] from the [https://centres.clarin.eu/centre/4 ASV centre] should take over service responsibility from Benedikt. The service will be administered technically by [[mailto:"Thomas Hynek" <hynek@informatik.uni-leipzig.de>]]. CLARIN sysops will maintain an advisory, secondary role on the technical level. 21 1. The CLARIN sysops should have full administrative access over SSH to the monitoring host `fsd-cloud22` for emergency maintenance. This includes being able to schedule downtime in Icinga. 22 1. Spurious unavailability notifications/statuses should no longer occur. Missing unavailability notifications should be added. 23 1. All hosts should allow pings (ICMP traffic). ICMP is a critical part of the Internet infrastructure. Filtering it yields no benefit, but can make it impossible to check whether a host is down or just a service running on it. 20 All requests and issues relating to monitoring should be processed through the Trac component Monitoring (see ‘Tickets’ below). 24 21 25 Subsidiary objectives are: 26 1. Hosting the Icinga server in a Docker container, for easier testing and maintenance. 27 1. A migration from Icinga 1.x to 2.x. 28 1. Reduction of technical complexity. 22 ## Links ## 29 23 30 ### Activities ### 31 '''Meetings'''\\ 32 * [wiki:./Meetings/20160218] 24 For internal discussion, Benedikt (FZJ), Daniel (ASV), Dieter (CLARIN), Thomas Eckart (ASV), Thomas Hynek (ASV) and Sander (MPI-PL) are currently reachable via monitoring@clarin.eu. 33 25 34 # Current state # 26 To view the current monitoring state, browse the [https://fsd-cloud22.fz-juelich.de/icinga/ Icinga frontend]. 35 27 36 ## Functional information ## 28 [wiki:SystemAdministration/Hosts/fsd-cloud22.zam.kfa-juelich.de About the monitoring host fsd-cloud22]. 37 29 38 ### Monitored types of systems ### 30 To see how changes to the Git repo have been propagated to the monitoring host, see [https://fsd-cloud22.fz-juelich.de:7011/logs/ Icinga configuration GitHub-host sync logs]. 31 32 # Activities # 33 34 ## Tickets ## 35 36 [[TicketQuery(col=ticket|priority|summary|owner|created|modified,component=Monitoring,order=modified,desc=true,table)]] 37 38 ## Meetings ## 39 [[./Meetings/2016-02-18]] 40 41 [[./Meetings/2016-04-04]] 42 43 # Details # 44 45 ## Monitored types of systems ## 39 46 40 47 For each centre, the following systems should be monitored: … … 47 54 * websites 48 55 49 ### User groups ### 56 In addition, all central services should be monitored both at frontend and backends, if any. 50 57 51 Centre technical contacts, as well as the CLARIN sysops, can view the status of centre services. Each centre's technical contact and possibly additional monitoring contacts, as declared in the [wiki:"Centre Registry" Centre Registry], will get an e-mail notification of important status changes (e.g., sudden unavailability). Furthermore, each of these contacts can view a detailed Icinga status for their services. 52 In addition, !NagVis geographical overviews are available publicly for Germany and the world. 58 ## Checks ## 53 59 54 ### References ###55 56 1. [https://fsd-cloud22.fz-juelich.de/icinga/ Icinga]57 1. [https://fsd-cloud22.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-EU_Geomap NagVis geomap for CLARIN ERIC]58 1. [https://fsd-cloud22.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-DE_Service_Overview NagVis geomap for CLARIN-D]59 1. A [http://clarin-d.net/de/aktuell/status-infrastruktur status map of Germany]60 61 ### Validity and availability checks ###62 60 The Icinga server is set up to check '''availability''' (e.g., does an important HTTP URL work), '''validity''' (e.g., does an endpoint return well-formed and valid XML data?) and '''security''' (does the URL work with appropriate confidentiality and authentication, using e.g. TLS 1.2). 63 61 64 ## Technical information ##65 66 Authentication is based on SAML. A [https://fsd-cloud22.fz-juelich.de/Shibboleth.sso/Metadata custom SP] interoperates with the CLARIN IdP for this. The eduPersonPrincipalName attribute is used for authorization. The value of this attribute is matched against the value recorded in the contact persons table of the Centre Registry. The CLARIN sysops have special access to views on all services. Centre technical contacts have access to views on their centre's services.67 68 ### Contacts ###69 70 '''Technical specialist on current Icinga setup''' \\71 [[mailto:"Benedikt von St. Vith" <CLARIN-support@fz-juelich.de>]]72 73 '''CLARIN technical contact person, some expertise on current Icinga setup''' \\74 [[mailto:"Sander Maijers" <sander@clarin.eu>]]75 76 ### References ###77 78 1. The Icinga configuration is managed through a Git repo [https://github.com/clarin-eric/monitoring on GitHub].79 1. [https://fsd-cloud22.fz-juelich.de:7011/logs/ Icinga configuration GitHub-host sync logs]80 1. A barebones Icinga 1.x container that runs an Icinga daemon with the CLARIN ERIC and CLARIN-D configuration is available [https://github.com/clarin-eric/virtual_debian-monitoring on GitHub].81 82 ### Implementation details of validity and availability checks ###83 62 To check whether a service that is up is not just returning undesired data, some validity checks are in place. The response of calling the Identify verb on an OAI-PMH endpoint is validated with a set of XSD schemas. FCS endpoints are currently not validated, though some SRU XSD schemas are available, because of problems with the approach. In addition, services that use TLS are only checked using the current version of TLS, v1.2. Exceptions are hard-coded and should be minimized. 84 63 85 ## # Dependencies, environmental requirements ###64 ## Users ## 86 65 87 || '''software''' || '''function''' || 66 Centre technical contacts, as well as the CLARIN sysops, can view the status of centre services. Each centre's technical contact and possibly additional monitoring contacts, as declared in the [wiki:"Centre Registry" Centre Registry], will get an e-mail notification of important status changes (e.g., sudden unavailability). Furthermore, each of these contacts can view a detailed Icinga status for their services. 67 In addition, !NagVis geographical overviews are available publicly for [http://clarin-d.net/de/aktuell/status-infrastruktur Germany] and [https://fsd-cloud22.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-EU_Geomap the world]. 68 69 ## Access ## 70 Authentication is based on SAML. A [https://fsd-cloud22.fz-juelich.de/Shibboleth.sso/Metadata custom SP] interoperates with the CLARIN IdP for this. The `eduPersonPrincipalName` attribute is used for authorization. The value of this attribute is matched against the value recorded in the contact persons table of the Centre Registry. The ASV administrators and CLARIN sysops have special access to views on all services. Centre technical contacts have access to views on their centre's services. 71 72 ## Dependencies, system requirements ## 73 74 ||= '''software''' =||= '''function''' =|| 88 75 || [http://docs.icinga.org/latest/en/nsca.html NSCA] || Monitoring information from MPCDF. Requires Icinga UNIX socket. Important for CLARIN-D because of the distributed infrastructure. || 89 76 || [http://docs.icinga.org/latest/en/configido.html IDO2DB] || writes Icinga information to MySQL || … … 94 81 || [http://www.nagvis.org/ NagVis] || visualizes service status geographically. || 95 82 || [http://docs.icinga.org/latest/en/nrpe.html NRPE package] || For checks of host-local properties, such as disk space, memory usage, etc. || 83 84 # History # 85 86 During 2011-2015, [[mailto:"Sander Maijers" <sander@clarin.eu>]] and [[mailto:"Dieter Van Uytvanck" <dieter@clarin.eu>]] maintained a Nagios server on [wiki:SystemAdministration/Hosts/ems04.mpi.nl ems04] that covered CLARIN-D related services hosted at [wiki:SystemAdministration/Hosters/MPI-PL MPI-PL]. Since 2012, [[mailto:"Benedikt von St. Vith" <b.von.st.vieth@fz-juelich.de>]] has maintained a CLARIN-D wide Nagios hosted at [wiki:SystemAdministration/Hosters/FZJ FZJ], and later Icinga 1.x, server on [wiki:SystemAdministration/Hosts/clarin.fz-juelich.de clarin.fz-juelich.de]. During 2015, all CLARIN (i.e., ERIC) related checks have been migrated from `ems04` to the latter host `clarin.fz-juelich.de`. The Icinga configuration on this host is controlled only indirectly by [[mailto:"Dieter, Sander and Willem" <sysops@clarin.eu>]] (CLARIN sysops), as described in the following sections. Direct system administration, e.g. Apache httpd and Icinga daemon state, as well as SSH access have been exclusive to Benedikt for FZJ policy reasons. In 2016, Benedikt set up a new host for CLARIN wide Icinga-based infra monitoring `fsd-cloud22`. This host has been configured as a replica of `clarin.fz-juelich.de`. [[mailto:"Thomas Eckart" <teckart@informatik.uni-leipzig.de>]], [[mailto:"Dirk Goldhahn" <dgoldhahn@informatik.uni-leipzig.de>]] from the [https://centres.clarin.eu/centre/4 ASV centre] took over service responsibility from Benedikt.