1 | | == Currently reachable via == |
| 1 | = Current setup |
| 2 | |
| 3 | == CLARIN ERIC & CLARIN-D |
| 4 | |
| 5 | We use Icinga, hosted and managed by Forschungszentrum Jülich. Contact: Benedikt von St. Vith <CLARIN-support@fz-juelich.de> |
| 6 | |
| 7 | Users can view the status of each centre and its service(s) on NagVis. Each centre's technical contact and possibly additional monitoring contacts, as declared in the [wiki:"Centre Registry" Centre Registry], will get an e-mail notification of important status changes (e.g. outage). Furthermore, each of these contacts can view a detailed Icinga status page about their services. The CLARIN ERIC sysops <sysops@clarin.eu> have access to all of Icinga, except for scheduling downtime, and actions that require shell access and permissions (e.g. restarting the Icinga daemon). The latter is managed by Benedikt von St. Vith. |
| 8 | |
| 9 | === URLs |
| 10 | |
| 11 | * Icinga: https://clarin.fz-juelich.de/icinga/ |
| 12 | * NagVis geomap for CLARIN ERIC: https://clarin.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-EU_Geomap |
| 13 | * NagVis geomap for CLARIN-D: https://clarin.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-DE_Service_Overview |
13 | | == AAI monitoring == |
| 27 | The current configuration is managed through https://github.com/clarin-eric/monitoring. |
| 28 | |
| 29 | ==== Dependencies of current setup |
| 30 | |
| 31 | - NSCA: Monitoring information from RZG. Requires Icinga UNIX socket. Important for CLARIN-D because of the distributed infrastructure. |
| 32 | - IDO2DB: Writing Icinga information to MySQL |
| 33 | - php4nagios: No service, but needs access to Icinga outputs to create graphs. |
| 34 | - npcd: Processing Icinga performance data for pnp4nagios. |
| 35 | - shibd: Shibboleth authentication for web interface. |
| 36 | - PyNag: Python library to process/manipulate Nagios/Icinga 1.x configuration. |
| 37 | - NagVis: Used to visualise service status geographically. |
| 38 | |
| 39 | CLARIN-D hosts of which the services & status are to be checked will need to have the Nagios NRPE package (only for local checks like status of disk space, memory, etc.) and port 5666 open. |
| 40 | |
| 41 | === Centre requirements |
| 42 | |
| 43 | The monitoring requirements of CLARIN-D are pretty modest - e.g. a simple Icinga installation would be sufficient to fulfill these requirements. |
| 44 | * SAML Servers Provider (IdP) status pages must be reachable from other servers |
| 45 | * Regular checks if hosts are up and reachable (ping/ICMP). |
| 46 | * Regular checks if essential networked services and applications are working (e.g. HTTP). |
| 47 | * The server running the monitoring software is also monitored itself. |
| 48 | |
| 49 | == AAI |
| 50 | |
| 51 | === Requirements |
| 52 | |
| 53 | As of September 2015, CLARIN-PLUS Task 2.1 covers work on improving monitoring of AAI infrastructure (CLARIN SPF & CLARIN Identity and Access Management). |
| 54 | A ‘working’ endpoint is defined as one that does not return an errors-signaling HTTP status code beyond 404. |
| 55 | |
| 56 | We should monitor: |
| 57 | * SAML metadata batches about SPF SPs and IdPs for availability, security, etc. '''IMPLEMENTED'''. |
| 58 | * Discovery Service availability. '''IMPLEMENTED'''. |
| 59 | * Upstream identity federation SAML metadata batches about IdPs for availability, security, etc. '''IMPLEMENTED, NEEDS IMPROVEMENT'''. |
| 60 | * SAML metadata batches about SPF IdPs, whether they continually contain all IdPs in each identity federation. |
| 61 | * Once all SAML Service Provider status pages of SPF SPs are reachable from outside, these pages for availability and whether they work. |
| 62 | * Correct operation of SPs: |
| 63 | - Does the login page work? |
| 64 | - Do all endpoints work? |
| 65 | |
| 66 | == Locally at centres |
| 67 | |
| 68 | === Guidelines |
| 69 | |
| 70 | * web services and applications |
| 71 | - remote API |
| 72 | - user interface |
| 73 | * repositories |
| 74 | * Handle servers (in case centers have their own, like IDS?) |
| 75 | * sample PID URLs |
| 76 | * websites |
| 77 | |
| 78 | == Resources |
| 79 | |
| 80 | === AAI |
22 | | * https://infra.clarin.eu/nagios3/ |
23 | | * see: https://trac.clarin.eu/browser/monitoring/plugins/mpi (includes SRU/CQL and OAI-PMH probe) |
24 | | |
25 | | == CLARIN-D monitoring requirements == |
26 | | |
27 | | The monitoring requirements of CLARIN-D are pretty modest - a simple icinga or nagios installation would be sufficient to fullfill all the needs: |
28 | | |
29 | | * Regular checks if hosts are up and reachable (ping). |
30 | | * Regular checks if certain network services are working (e.g. http). |
31 | | * Each center can provide [http://nagiosplug.sourceforge.net/developer-guidelines.html nagios plugins] to assess the state of that center's services. |
32 | | * Each center can register one or more contact persons. They will get an e-mail with a warning if a host or service is not working correctly. |
33 | | * The server running the monitoring software is also monitored itself. |
34 | | * Users (not only the center administrators) should be able to see the status of each center and its service(s) on a website. |
35 | | * For access to the web interface of icinga/nagios authentication & authorization via shibboleth would be nice. |
36 | | * (IP) Adresses for external probes should be delivered via the [https://centerregistry-clarin.esc.rzg.mpg.de/ Center Registry] ([https://trac.clarin.eu/wiki/CenterRegistry docs]) |
37 | | * As visualisation a [http://www.clarin-d.de/images/karte.png map of germany] under http://de.clarin.eu/status momentan [http://clarin-d.de/de/aktuelles/status-infrastruktur.html hier] (für Joomla-Nutzer) via Nagvis-Plugin |
38 | | - with traffic light alarm indication (?) maybe [http://www.laendercheck-wissenschaft.de/archiv/privater_hochschulsektor/status_quo/status_quo_deutschlandkarte.jpg like this] (?) |
39 | | - with graphs to see how long or how often services have been unavailable in the past? Via link to nagios? |
| 90 | A centre can monitor its own services. The following example monitoring plugins in Python 2 can assess SRU/CQL and OAI-PMH endpoints. |
| 91 | https://trac.clarin.eu/browser/monitoring/plugins/mpi |
62 | | |
63 | | == Centre requirements == |
64 | | * Shibboleth Identity Providers (IdP) status pages must be reachable from other servers |
65 | | * Shibboleth Servers Providers (IdP) status pages must be reachable from other servers |
66 | | |
67 | | |
68 | | == [https://trac.clarin.eu/wiki/CenterRegistry center registry] requirements == |
69 | | |
70 | | '''Current priorities:''' |
71 | | * Federated content search endpoints (multiple per center) |
72 | | * OAI-PMH end points |
73 | | |
74 | | Other points: |
75 | | |
76 | | add server data for: |
77 | | * Shibboleth Identity Providers (IdP) |
78 | | * Shibboleth Service Providers (SP) (multiple per center) |
79 | | * Shibboleth Where are You From servers (WAYF, currently only one available?) |
80 | | * REST Webservices (multiple per center) |
81 | | * Repositories |
82 | | * Handle servers (in case centers have their own, like IDS?) |
83 | | * sample PID URLs per center |
84 | | * nagios servers (if available per center) |
85 | | and a list of diverse webservers (VLO. TLA, WebLicht) |
86 | | |
87 | | == Nagios server requirements == |
88 | | |