[[PageOutline(1-5,Table of Contents,pullout)]] # CLARIN ERIC & CLARIN-D # We use Icinga, hosted and managed by Forschungszentrum Jülich. Contact: Benedikt von St. Vith Users can view the status of number of centre services on NagVis. Each centre's technical contact and possibly additional monitoring contacts, as declared in the [wiki:"Centre Registry" Centre Registry], will get an e-mail notification of important status changes (e.g. outage). Furthermore, each of these contacts can view a detailed Icinga status page about their services. The CLARIN ERIC sysops have access to all of Icinga, except for scheduling downtime, and actions that require shell access and permissions (e.g. restarting the Icinga daemon). The latter is managed by Benedikt von St. Vith. ## Validation of responses and security ## To check whether a service that is up is not just returning undesired data, some validity checks are in place. The response of calling the Identify verb on an OAI-PMH endpoint is validated with a set of XSD schemas. FCS endpoints are currently not validated, though some SRU XSD schemas are available, because of problems with the approach. In addition, services that use TLS are only checked using the current version of TLS, v1.2. Exceptions are hard-coded and should be minimized. ## URLs ## * [https://clarin.fz-juelich.de/icinga/ Icinga] * [https://clarin.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-EU_Geomap NagVis geomap for CLARIN ERIC] * [https://clarin.fz-juelich.de/nagvis/frontend/nagvis-js/index.php?mod=Map&act=view&show=Clarin-DE_Service_Overview NagVis geomap for CLARIN-D] * [https://clarin.fz-juelich.de:7011/logs/ Icinga configuration sync logs] * CLARIN-D status of infra * [http://clarin-d.de/de/aktuelles/status-infrastruktur.html] * [http://clarin-d.de/status] -> [http://clarin-d.de/de/aktuelles/status-infrastruktur] (planned) * As visualisation a [http://www.clarin-d.de/images/karte.png map of Germany] under http://de.clarin.eu/status. Currently at http://clarin-d.de/de/aktuelles/status-infrastruktur.html (for Joomla users) ### Centre requirements ### The monitoring requirements of CLARIN-D are pretty modest - e.g. a simple Icinga installation would be sufficient to fulfill these requirements. * SAML Servers Provider (IdP) status pages must be reachable from other servers * Regular checks if hosts are up and reachable (ping/ICMP). * Regular checks if essential networked services and applications are working (e.g. HTTP). * The server running the monitoring software is also monitored itself. ## AAI ## ### Requirements ### As of September 2015, CLARIN-PLUS Task 2.1 covers work on improving monitoring of authentication and authorization infrastructure (CLARIN SPF & CLARIN Identity and Access Management). (A ‘working’ endpoint is defined as one that does not return an error-signaling HTTP response code beyond 404 upon a request.) We should monitor: * SAML metadata batches about SPF SPs and IdPs for availability, security, etc. '''IMPLEMENTED'''. * Discovery Service availability. '''IMPLEMENTED'''. * Upstream identity federation SAML metadata batches about IdPs for availability, security, etc. '''IMPLEMENTED, NEEDS IMPROVEMENT'''. * SAML metadata batches about SPF IdPs, whether they continually contain all IdPs in each identity federation. * Once all SAML Service Provider status pages of SPF SPs are reachable from outside, these pages for availability and whether they work. * Correct operation of SPs: * Does the login page work? * Do all endpoints work? ## Locally at centres ## ### Guidelines ### * web services and applications * remote API * user interface * repositories * Handle servers (in case centers have their own, like IDS?) * sample PID URLs * websites ## Resources ## ### AAI ### * AAI eye: http://www.csc.fi/english/institutions/haka/instructions/services-tech/aaieye * RAPTOR: http://iam.cf.ac.uk/trac/RAPTOR * IDS: https://trac.clarin.eu/browser/monitoring/plugins/ids * https://svn.ms.mff.cuni.cz/redmine/projects/dspace-modifications/wiki/AAIShibbie ## Issues to be resolved ## 1. Technical 1. Excessive complexity * Inability for non-admins to test configuration changes without pushing them to the Git repo. * Complicated workflow in which custom software (PyNag-based) of substantial complexity mutates the configuration. * Complicated shell scripts and tools to perform probes. * Icinga 1.x configuration is very verbose and error prone (e.g. whitespace, lots of duplication) * Dependency on a lot of tools/extensions (see: ''Dependencies of current setup''). 1. Organizational * No policy for when a service isunavailable. Sometimes no follow-up, never follow-up by a single, responsible person. ## Work plan ## * Migrate from Icinga 1.x to [https://www.icinga.org/icinga/icinga-2/ Icinga 2] daemon and Icinga Web 2 frontend. * The configuration can be migrated [http://docs.icinga.org/icinga2/latest/doc/module/icinga2/toc#!/icinga2/latest/doc/module/icinga2/chapter/migration automatically], with some manual cleanup afterwards. * Reduce the large set of dependencies external to standard Icinga 2 distribution to minimum. * Use (Docker) container based virtualized software stack for easy (re)deployment and testing. * Simplify configuration workflow, currently described in the [https://github.com/clarin-eric/monitoring/blob/master/README.md README] for GitHub repository for the current configuration. * Use protected branch for successfully tested production configuration. ---- '''NOTE''': the following information is outdated and irrelevant now. All CLARIN ERIC centres have the services monitored centrally. This is kept for reference to those centres interested in (older) separate plugins or the historical work. A centre can monitor its own services. The following example monitoring plugins in Python 2 can assess SRU/CQL and OAI-PMH endpoints. https://trac.clarin.eu/browser/monitoring/plugins/mpi || **Service Types / Tests** ||= **ping** =||= **HTTP** =||= **disk space** =||= **load** =||= **free mem** =||= **users** =||= **functional check** =||= **query duration** =|| ||= AAI Service Providers (SP)=|| * || # || || || || || #([https://trac.clarin.eu/browser/monitoring/plugins/ids IDS probe]?) || || ||= AAI Identity Providers (IdP)=|| * || # || || * || * || || #([https://trac.clarin.eu/browser/monitoring/plugins/ids IDS probe]?) || || ||= AAI Where are you From (WAYF)=|| * || # || || || || || #([https://trac.clarin.eu/browser/monitoring/plugins/mpi MPI discojuice probe]?) || || ||= REST-Webservices (WebLicht)=|| * || || || || || || #(provenance data from TCF?) || || ||= Federated Content Search Endpoints (SRU/CQL)=|| * || # || || || || || #([https://trac.clarin.eu/browser/monitoring/plugins/mpi MPI probe]?) || || ||= Federated Content Search Aggregator=|| * || # || || || || || # || || ||= Repositories=|| * || # || * || || || || #(test for a[http://localhost:8080/fedora/objects/fedora-system:FedoraObject-3.0 fedora content model]?) || || ||= OAI-PMH Gateway=|| * || || || || || || #([https://trac.clarin.eu/browser/monitoring/plugins/mpi MPI probe]?) || || ||= Handle Servers=|| * || || || || || || #(EUDAT/Jülich probe?) || #(Eric's timeout [https://svn.clarin.eu/monitoring/plugins/mpi/HandleSystem/ probe]) || ||= resolve a sample PID for each repository=|| || || || || || || # || # || ||= Centre Registry=|| * || || || || || || # || || ||= WebLicht webserver=|| * || # || || || || || || || ||= other webservers=|| * || # || || || || || || || ||= Nagios servers (selfcheck)=|| * || # || || || || || #(check_nagios plugin) || || ||= Nagios servers crosscheck (from other centre)=|| * || || || || || || #(check_nagios plugin) || || ||= Workspaces server (not yet)=|| n.a. || || n.a. || || || || n.a. || || # mandatory; * optional ----