= Current setup == CLARIN ERIC & CLARIN-D We use Icinga, hosted and managed by Forschungszentrum Jülich. Contact: Benedikt von St. Vith Users can view the status of each centre and its service(s) on NagVis. Each centre's technical contact and possibly additional monitoring contacts, as declared in the [wiki:"Centre Registry" Centre Registry], will get an e-mail notification of important status changes (e.g. outage). Furthermore, each of these contacts can view a detailed Icinga status page about their services. The CLARIN ERIC sysops have access to all of Icinga, except for scheduling downtime, and actions that require shell access and permissions (e.g. restarting the Icinga daemon). The latter is managed by Benedikt von St. Vith. === URLs * Icinga: https://clarin.fz-juelich.de/icinga/ * NagVis geomap for CLARIN ERIC: https://clarin.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-EU_Geomap * NagVis geomap for CLARIN-D: https://clarin.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-DE_Service_Overview * http://clarin-d.de/de/aktuelles/status-infrastruktur.html * http://clarin-d.de/status (planned) * As visualisation a [http://www.clarin-d.de/images/karte.png map of Germany] under http://de.clarin.eu/status. Currently at http://clarin-d.de/de/aktuelles/status-infrastruktur.html (for Joomla users) === Software stack Icinga 1.x. (Nagios fork, same format for plugins etc): https://www.icinga.org/ A barebones Icinga 1.x container that runs an Icinga daemon with the CLARIN ERIC and CLARIN-D configuration is available at: https://github.com/clarin-eric/virtual_debian-monitoring ==== Configuration The current configuration is managed through https://github.com/clarin-eric/monitoring. ==== Dependencies of current setup - NSCA: Monitoring information from RZG. Requires Icinga UNIX socket. Important for CLARIN-D because of the distributed infrastructure. - IDO2DB: Writing Icinga information to MySQL - php4nagios: No service, but needs access to Icinga outputs to create graphs. - npcd: Processing Icinga performance data for pnp4nagios. - shibd: Shibboleth authentication for web interface. - PyNag: Python library to process/manipulate Nagios/Icinga 1.x configuration. - NagVis: Used to visualise service status geographically. CLARIN-D hosts of which the services & status are to be checked will need to have the Nagios NRPE package (only for local checks like status of disk space, memory, etc.) and port 5666 open. === Centre requirements The monitoring requirements of CLARIN-D are pretty modest - e.g. a simple Icinga installation would be sufficient to fulfill these requirements. * SAML Servers Provider (IdP) status pages must be reachable from other servers * Regular checks if hosts are up and reachable (ping/ICMP). * Regular checks if essential networked services and applications are working (e.g. HTTP). * The server running the monitoring software is also monitored itself. == AAI === Requirements As of September 2015, CLARIN-PLUS Task 2.1 covers work on improving monitoring of authentication and authorization infrastructure (CLARIN SPF & CLARIN Identity and Access Management). (A ‘working’ endpoint is defined as one that does not return an error-signaling HTTP response code beyond 404 upon a request.) We should monitor: * SAML metadata batches about SPF SPs and IdPs for availability, security, etc. '''IMPLEMENTED'''. * Discovery Service availability. '''IMPLEMENTED'''. * Upstream identity federation SAML metadata batches about IdPs for availability, security, etc. '''IMPLEMENTED, NEEDS IMPROVEMENT'''. * SAML metadata batches about SPF IdPs, whether they continually contain all IdPs in each identity federation. * Once all SAML Service Provider status pages of SPF SPs are reachable from outside, these pages for availability and whether they work. * Correct operation of SPs: - Does the login page work? - Do all endpoints work? == Locally at centres === Guidelines * web services and applications - remote API - user interface * repositories * Handle servers (in case centers have their own, like IDS?) * sample PID URLs * websites == Resources === AAI * AAI eye: http://www.csc.fi/english/institutions/haka/instructions/services-tech/aaieye * RAPTOR: http://iam.cf.ac.uk/trac/RAPTOR * IDS: https://trac.clarin.eu/browser/monitoring/plugins/ids * https://svn.ms.mff.cuni.cz/redmine/projects/dspace-modifications/wiki/AAIShibbie = Future setup * Migrate from Icinga 1.x to [https://www.icinga.org/icinga/icinga-2/ Icinga 2] daemon and Icinga Web 2 frontend. - The configuration can be migrated [http://docs.icinga.org/icinga2/latest/doc/module/icinga2/toc#!/icinga2/latest/doc/module/icinga2/chapter/migration automatically], with some manual cleanup afterwards. * Reduce the large set of dependencies external to standard Icinga 2 distribution to minimum. * Use (Docker) container based virtualized software stack for easy (re)deployment and testing. * Simplify configuration workflow, currently described in the [https://github.com/clarin-eric/monitoring/blob/master/README.md README] for GitHub repository for the current configuration. - Use protected branch for successfully tested production configuration. ---- '''NOTE''': the following information is outdated and irrelevant now. All CLARIN ERIC centres have the services monitored centrally. This is kept for reference to those centres interested in (older) separate plugins or the historical work. A centre can monitor its own services. The following example monitoring plugins in Python 2 can assess SRU/CQL and OAI-PMH endpoints. https://trac.clarin.eu/browser/monitoring/plugins/mpi || Service Types / Tests ||= ping =||= http =||= disk space =||= load =||= free mem =||= users =||= functional check =||= query duration time =|| ||= AAI Service Providers (SP)=|| * || # || || || || || #([https://trac.clarin.eu/browser/monitoring/plugins/ids IDS probe]?) || || ||= AAI Identity Providers (IdP)=|| * || # || || * || * || || #([https://trac.clarin.eu/browser/monitoring/plugins/ids IDS probe]?) || || ||= AAI Where are you From (WAYF)=|| * || # || || || || || #([https://trac.clarin.eu/browser/monitoring/plugins/mpi MPI discojuice probe]?) || || ||= REST-Webservices (WebLicht)=|| * || || || || || || #(provenance data aus TCF?) || || ||= Federated Content Search Endpoints (SRU/CQL)=|| * || # || || || || || #([https://trac.clarin.eu/browser/monitoring/plugins/mpi MPI probe]?) || || ||= Federated Content Search Aggregator=|| * || # || || || || || # || || ||= Repositories=|| * || # || * || || || || #(test for a[http://localhost:8080/fedora/objects/fedora-system:FedoraObject-3.0 fedora content model]?) || || ||= OAI-PMH Gateway=|| * || || || || || || #([https://trac.clarin.eu/browser/monitoring/plugins/mpi MPI probe]?) || || ||= Handle Servers=|| * || || || || || || #(EUDAT/Jülich probe?) || #(Eric's timeout [https://svn.clarin.eu/monitoring/plugins/mpi/HandleSystem/ probe]) || ||= resolve a sample PID for each repository=|| || || || || || || # || # || ||= Centre Registry=|| * || || || || || || # || || ||= WebLicht webserver=|| * || # || || || || || || || ||= other webservers=|| * || # || || || || || || || ||= Nagios servers (selfcheck)=|| * || # || || || || || #(check_nagios plugin) || || ||= Nagios servers crosscheck (from other centre)=|| * || || || || || || #(check_nagios plugin) || || ||= Workspaces server (not yet)=|| n.a. || || n.a. || || || || n.a. || || # mandatory; * optional ----