Changes between Version 81 and Version 82 of SystemAdministration/Monitoring/Icinga/Outdated


Ignore:
Timestamp:
10/07/15 10:55:07 (9 years ago)
Author:
Sander Maijers
Comment:

Complete overhaul. Removal of outdated content. Restructuring into distinct monitoring domains (CLARIN-D, ERIC, etc.). Listing of dependencies.

Legend:

Unmodified
Added
Removed
Modified
  • SystemAdministration/Monitoring/Icinga/Outdated

    v81 v82  
    1 == Currently reachable via ==
     1= Current setup
     2
     3== CLARIN ERIC & CLARIN-D
     4
     5We use Icinga, hosted and managed by Forschungszentrum Jülich. Contact: Benedikt von St. Vith <CLARIN-support@fz-juelich.de>
     6
     7Users can view the status of each centre and its service(s) on NagVis. Each centre's technical contact and possibly additional monitoring contacts, as declared in the [wiki:"Centre Registry" Centre Registry], will get an e-mail notification of important status changes (e.g. outage). Furthermore, each of these contacts can view a detailed Icinga status page about their services. The CLARIN ERIC sysops <sysops@clarin.eu> have access to all of Icinga, except for scheduling downtime, and actions that require shell access and permissions (e.g. restarting the Icinga daemon). The latter is managed by Benedikt von St. Vith.
     8
     9=== URLs
     10
     11 * Icinga: https://clarin.fz-juelich.de/icinga/
     12 * NagVis geomap for CLARIN ERIC: https://clarin.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-EU_Geomap
     13 * NagVis geomap for CLARIN-D: https://clarin.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-DE_Service_Overview
    214
    315 * http://clarin-d.de/de/aktuelles/status-infrastruktur.html
    416 * http://clarin-d.de/status (planned)
     17 * As visualisation a [http://www.clarin-d.de/images/karte.png map of Germany] under http://de.clarin.eu/status. Currently at http://clarin-d.de/de/aktuelles/status-infrastruktur.html (for Joomla users)
    518
    6 == Server + services monitoring tools ==
     19=== Software stack
     20 
     21Icinga 1.x. (Nagios fork, same format for plugins etc): https://www.icinga.org/
    722
    8  * Nagios: http://www.nagios.org/
    9  * Icinga (nagios fork, same format for plugins etc): https://www.icinga.org/
    10  * clients/remote servers will need to have the Nagios NRPE package (only for local checks like diskspace, mem etc) and port 5666 open
     23A barebones Icinga 1.x container that runs an Icinga daemon with the CLARIN ERIC and CLARIN-D configuration is available at: https://github.com/clarin-eric/virtual_debian-monitoring
    1124
     25==== Configuration
    1226
    13 == AAI monitoring ==
     27The current configuration is managed through https://github.com/clarin-eric/monitoring.
     28
     29==== Dependencies of current setup
     30
     31- NSCA: Monitoring information from RZG. Requires Icinga UNIX socket. Important for CLARIN-D because of the distributed infrastructure.
     32- IDO2DB: Writing Icinga information to MySQL
     33- php4nagios: No service, but needs access to Icinga outputs to create graphs.
     34- npcd: Processing Icinga performance data for pnp4nagios.
     35- shibd: Shibboleth authentication for web interface.
     36- PyNag: Python library to process/manipulate Nagios/Icinga 1.x configuration.
     37- NagVis: Used to visualise service status geographically.
     38
     39CLARIN-D hosts of which the services & status are to be checked will need to have the Nagios NRPE package (only for local checks like status of disk space, memory, etc.) and port 5666 open.
     40
     41=== Centre requirements
     42
     43The monitoring requirements of CLARIN-D are pretty modest - e.g. a simple Icinga installation would be sufficient to fulfill these requirements.
     44 * SAML Servers Provider (IdP) status pages must be reachable from other servers
     45 * Regular checks if hosts are up and reachable (ping/ICMP).
     46 * Regular checks if essential networked services and applications are working (e.g. HTTP).
     47 * The server running the monitoring software is also monitored itself.
     48
     49== AAI
     50
     51=== Requirements
     52
     53As of September 2015, CLARIN-PLUS Task 2.1 covers work on improving monitoring of AAI infrastructure (CLARIN SPF & CLARIN Identity and Access Management).
     54A ‘working’ endpoint is defined as one that does not return an errors-signaling HTTP status code beyond 404.
     55
     56We should monitor:
     57 * SAML metadata batches about SPF SPs and IdPs for availability, security, etc. '''IMPLEMENTED'''.
     58 * Discovery Service availability. '''IMPLEMENTED'''.
     59 * Upstream identity federation SAML metadata batches about IdPs for availability, security, etc. '''IMPLEMENTED, NEEDS IMPROVEMENT'''.
     60 * SAML metadata batches about SPF IdPs, whether they continually contain all IdPs in each identity federation.
     61 * Once all SAML Service Provider status pages of SPF SPs are reachable from outside, these pages for availability and whether they work.
     62 * Correct operation of SPs:
     63    - Does the login page work?
     64    - Do all endpoints work?
     65
     66== Locally at centres
     67
     68=== Guidelines
     69
     70 * web services and applications
     71   - remote API
     72   - user interface
     73 * repositories
     74 * Handle servers (in case centers have their own, like IDS?)
     75 * sample PID URLs
     76 * websites
     77
     78== Resources
     79
     80=== AAI
    1481
    1582 * AAI eye: http://www.csc.fi/english/institutions/haka/instructions/services-tech/aaieye
     
    1885 * https://svn.ms.mff.cuni.cz/redmine/projects/dspace-modifications/wiki/AAIShibbie
    1986
    20 == MPI services monitoring ==
     87----
     88'''NOTE''': the following information is outdated and irrelevant now. All CLARIN ERIC centres have the services monitored centrally. This is kept for reference to those centres interested in (older) separate plugins or the historical work.
    2189
    22  * https://infra.clarin.eu/nagios3/
    23  * see: https://trac.clarin.eu/browser/monitoring/plugins/mpi (includes SRU/CQL and OAI-PMH probe)
    24 
    25 == CLARIN-D monitoring requirements ==
    26 
    27 The monitoring requirements of CLARIN-D are pretty modest - a simple icinga or nagios installation would be sufficient to fullfill all the needs:
    28 
    29  * Regular checks if hosts are up and reachable (ping).
    30  * Regular checks if certain network services are working (e.g. http).
    31  * Each center can provide [http://nagiosplug.sourceforge.net/developer-guidelines.html nagios plugins] to assess the state of that center's services.
    32  * Each center can register one or more contact persons. They will get an e-mail with a warning if a host or service is not working correctly.
    33  * The server running the monitoring software is also monitored itself.
    34  * Users (not only the center administrators) should be able to see the status of each center and its service(s) on a website.
    35  * For access to the web interface of icinga/nagios authentication & authorization via shibboleth would be nice.
    36  * (IP) Adresses for external probes should be delivered via the [https://centerregistry-clarin.esc.rzg.mpg.de/ Center Registry] ([https://trac.clarin.eu/wiki/CenterRegistry docs])
    37  * As visualisation a [http://www.clarin-d.de/images/karte.png map of germany] under http://de.clarin.eu/status momentan [http://clarin-d.de/de/aktuelles/status-infrastruktur.html hier] (für Joomla-Nutzer) via Nagvis-Plugin
    38    - with traffic light alarm indication (?) maybe [http://www.laendercheck-wissenschaft.de/archiv/privater_hochschulsektor/status_quo/status_quo_deutschlandkarte.jpg like this] (?)
    39    - with graphs to see how long or how often services have been unavailable in the past? Via link to nagios?
     90A centre can monitor its own services. The following example monitoring plugins in Python 2 can assess SRU/CQL and OAI-PMH endpoints.
     91https://trac.clarin.eu/browser/monitoring/plugins/mpi
    4092
    4193||  Service Types / Tests  ||= ping =||= http =||= disk space =||= load =||= free mem =||= users =||= functional check =||= query duration time =||
     
    50102||= Handle Servers=||  *  ||  ||  ||  ||  ||  ||  #(EUDAT/Jülich probe?)  ||  #(Eric's timeout [https://svn.clarin.eu/monitoring/plugins/mpi/HandleSystem/ probe])  ||
    51103||= resolve a sample PID for each repository=|| ||  ||  ||  ||  ||  ||  #  ||  #  ||
    52 ||= Center Registry=||  *  ||  ||  ||  ||  ||  ||  #  || ||
     104||= Centre Registry=||  *  ||  ||  ||  ||  ||  ||  #  || ||
    53105||= WebLicht webserver=||  *  ||  #  ||  ||  ||  ||  ||  || ||
    54 ||= VLO webserver=||  *  ||  #  ||  ||  ||  ||  ||  || ||
    55 ||= TLA webserver=||  *  ||  #  ||  ||  ||  ||  ||  || ||
    56106||= other webservers=||  *  ||  #  ||  ||  ||  ||  ||  || ||
    57107||= Nagios servers (selfcheck)=||  *  ||  #  ||    ||  ||  ||  ||  #(check_nagios plugin)  || ||
    58 ||= Nagios servers crosscheck (from other center)=||  *  ||  ||  ||  ||  ||  ||  #(check_nagios plugin)  || ||
     108||= Nagios servers crosscheck (from other centre)=||  *  ||  ||  ||  ||  ||  ||  #(check_nagios plugin)  || ||
    59109||= Workspaces server (not yet)=||  n.a.  ||  ||  n.a.  ||  ||  ||  ||  n.a.  || ||
    60110
    61111# mandatory; * optional
    62 
    63 == Centre requirements ==
    64 * Shibboleth Identity Providers (IdP) status pages must be reachable from other servers
    65 * Shibboleth Servers Providers (IdP) status pages must be reachable from other servers
    66 
    67 
    68 == [https://trac.clarin.eu/wiki/CenterRegistry center registry] requirements ==
    69 
    70 '''Current priorities:'''
    71 * Federated content search endpoints  (multiple per center)
    72 * OAI-PMH end points
    73 
    74 Other points:
    75 
    76 add server data for:
    77 * Shibboleth Identity Providers (IdP)
    78 * Shibboleth Service Providers (SP) (multiple per center)
    79 * Shibboleth Where are You From servers (WAYF, currently only one available?)
    80 * REST Webservices (multiple per center)
    81 * Repositories
    82 * Handle servers (in case centers have their own, like IDS?)
    83 * sample PID URLs per center
    84 * nagios servers (if available per center)
    85 and a list of diverse webservers (VLO. TLA, WebLicht)
    86 
    87 == Nagios server requirements ==
    88