- Timestamp:
- 12/01/13 19:04:51 (10 years ago)
- Location:
- SMC4LRT/chapters
- Files:
-
- 14 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/chapters/Conclusion.tex
r3776 r4117 11 11 12 12 %Irrespective of the additional levels - the user wants and has to get to the resource. (not always) to the "original" 13 And finally, a visualization tool for exploring the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}. Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features , that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly seewhich profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).13 And finally, a visualization tool for exploring the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}. Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see, which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain). 14 14 15 15 Within the CLARIN community a number of (permanent) tasks has been identified and corresponding task forces have been established, -
SMC4LRT/chapters/Data.tex
r3776 r4117 1 1 2 \chapter{Analysis of the data landscape}2 \chapter{Analysis of the Data Landscape} 3 3 \label{ch:data} 4 4 This section gives an overview of existing standards and formats for metadata in the field of Language Resources and Technology together with a description of their characteristics and their respective usage in the initiatives and data collections. Special attention is paid to the Component Metadata Framework representing the base data model for the infrastructure this work is part of. … … 10 10 The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.) 11 11 CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information. 12 The actual core provision for semantic interoperability is the requirement ,that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus12 The actual core provision for semantic interoperability is the requirement that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus 13 13 indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}. 14 14 15 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.15 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}. 16 16 17 17 While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}. 18 18 19 Once the profiles are defined they are transformed into a XML Schema ,that prescribes the structure of the instance records.19 Once the profiles are defined they are transformed into a XML Schema that prescribes the structure of the instance records. 20 20 The generated schema also conveys as annotation the information about the referenced data categories. 21 21 … … 57 57 The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}} 58 58 collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records. 59 16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records} , that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.60 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles .)So we encounter both situations: one profile being used by many providers and one provider using many profiles.59 16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records} that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there are a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152. 60 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles). So we encounter both situations: one profile being used by many providers and one provider using many profiles. 61 61 62 62 … … 107 107 24.583 & DoBeS archive \\ 108 108 23.185 & Language and Cognition \\ 109 17.859 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\ 109 110 14.593 & talkbank \\ 110 111 14.363 & Acquisition \\ 111 14.320 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\112 112 12.893 & MPI CGN \\ 113 113 10.628 & Bavarian Archive for Speech Signals (BAS) \\ … … 117 117 4.640 & Oxford Text Archive \\ 118 118 4.492 & Leipzig Corpora Collection \\ 119 3.539 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\120 119 3.280 & A Digital Archive of Research Papers in Computational Linguistics \\ 121 120 3.147 & CLARIN NL \\ 122 121 3.081 & MPI fÃŒr Bildungsforschung \\ 122 2.678 & WALS Online \\ 123 123 \hline 124 124 \end{tabu} … … 126 126 \end{table} 127 127 128 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).128 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand, there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource). 129 129 130 130 … … 135 135 Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts. 136 136 137 As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative /standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} pus the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE.138 139 140 \subsection{Dublin Core metadata terms}137 As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} puts the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE. 138 139 140 \subsection{Dublin Core Metadata Terms} 141 141 The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative. 142 142 … … 149 149 \end{description} 150 150 151 The DCMI terms format is very widely spread nowadays. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.152 153 There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.151 The DCMI terms format is very widely spread nowadays. Thanks to its simplicity, it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers. 152 153 There are multiple possible serializations, in particular a mapping to RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}. 154 154 Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}. 155 155 156 The simplicity of the format is also it 's main drawback when considered as metadata format in the research communities. It ittoo general to capture all specific details, individual research groups need to describe different kinds of resources with.156 The simplicity of the format is also its main drawback when considered as metadata format in the research communities. It is too general to capture all specific details, individual research groups need to describe different kinds of resources with. 157 157 158 158 \subsection{OLAC} 159 159 \label{def:OLAC} 160 160 161 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms},adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.162 163 The OLAC schema 161 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is an application profile \cite{heery2000application}, of the \xne{Dublin Core metadata terms} adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}. 162 163 The OLAC schema\furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field}, \code{role}, \code{linguistic-type}, \code{language}, \code{discourse-type}). 164 164 165 165 \begin{quotation} … … 179 179 OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''. 180 180 181 Note ,that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).181 Note that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}). 182 182 183 183 … … 187 187 188 188 \begin{quotation} 189 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots [Next to] its chief deliverable is a set of Guidelineswhich specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]189 The Text Encoding Initiative (TEI) is a consortium, which collectively develops and maintains a standard for the representation of texts in digital form \dots [Next to] its chief deliverable is a set of Guidelines, which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged] 190 190 \end{quotation} 191 191 192 TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility w rt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.192 TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility with respect to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}. 193 193 194 194 Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/} … … 199 199 \subsection{ISLE/IMDI -- The Language Archive} 200 200 201 \xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project \cite{wittenburg2000eagles} 2000 to 2003.202 203 To serve the main goal of the project, easing access to language resources fostering the reuse, resource description in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/},that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository.201 \xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project \cite{wittenburg2000eagles} 2000 to 2003. 202 203 To serve the main goal of the project, easing access to language resources fostering the reuse, resource descriptions in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/} that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository. 204 204 205 205 The project lead and responsible for running the repository and whole infrastructure was the Technical Group at MPI for Psycholinguistics, who has engaged in a number of projects aimed at building a stable technical infrastructure for long-term archiving and work with language resources since its foundation (together with the Institute itself) in 1970s\furl{http://tla.mpi.nl/home/history/}. Recently, the group and the established infrastructure has been renamed to \xne{TLA -- The Language Archive}\furl{http://tla.mpi.nl/} ``Your partner for language data, tools and archiving'', where on one platform both the hoist of language resources and their description are preserved and provided as well as tools for working with this data is offered. The archive is also an aggregator itself, offering various collection from different (also external) projects (like DOBES, CGN, RELISH, etc.). … … 213 213 \label{def:META-SHARE} 214 214 215 META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries ,that covered the technical aspects.215 META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries that covered the technical aspects. 216 216 217 217 … … 221 221 \end{quotation} 222 222 223 Within the project META-SHARE a new metadata format was developed\cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components.223 Within the project META-SHARE, a new metadata format was developed \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components. 224 224 %In cooperation between metadata teams from CLARIN and META-SHARE 225 225 226 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type howeverall four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI)227 228 The technical infrastructure of META-SHARE represents a distributed network of repositories consists of a number of member repositories, that offer their own subset of resource\furl{http://www.meta-share.eu/}.229 230 Selected member repositories\footnote{7 as of 2013-07} play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network'' \cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.226 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type, however, all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI) 227 228 The technical infrastructure of META-SHARE is a distributed network consisting of a number of member repositories that offer their own subset of resources\furl{http://www.meta-share.eu/}. 229 230 Selected member repositories\footnote{7 as of 2013-07} play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network'' \cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users. 231 231 The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes). 232 232 233 One point of criticism from the community was , the fact,that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.233 One point of criticism from the community was the fact that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint. 234 234 235 235 %? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology} … … 239 239 240 240 European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources (over 1.100) with focus on spoken resources, but also written, terminological and multimodal resources, mostly under license for a fee (although selected resources are available for free as well). 241 The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/} 241 The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}. 242 242 Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world. 243 243 … … 245 245 ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. 246 246 247 ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community.247 ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources, which may be needed by the HLT -- Human Language Technology -- community. 248 248 249 249 ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and … … 253 253 \subsection{LDC} 254 254 255 Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is provided for a fee, more than 650 resources have been made available since 1993. The catalogis freely accessible. The metadata is additionally aggregated by OLAC archives.255 Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is licensed for a fee, more than 650 resources have been made available since 1993. The catalogue is freely accessible. The metadata is additionally aggregated by OLAC archives. 256 256 257 257 \section{Formats and Collections in the World of Libraries} 258 258 \label{sec:lib-formats} 259 259 260 There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact ,that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right.260 There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right. 261 261 262 262 %\item[LoC] Library of Congress \url{http://www.loc.gov} … … 269 269 There is a long tradition of standardized metadata formats in the world of Libraries and Information Systems (LIS), major role in the standardization being assumed for decades by the Library of Congress\furl{http://www.loc.gov/standards/}. 270 270 271 The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- i s the standard format used for communication among libraries around the world.272 273 MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML;271 The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- it is the standard format used for communication among libraries around the world. 272 273 MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), which are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML; 274 274 275 275 \xne{METS -- Metadata Encoding and Transmission Standard} - a format from the family of Library of Congress standards (since 2001) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library. … … 277 277 A METS record acts as a flexible container that accomodates other pieces of data (different levels of metadata and encoded objects themselves or references to those) in external formats\furl{http://www.loc.gov/standards/mets/mets-extenders.html}. 278 278 279 Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html} 279 A number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}. 280 280 281 281 \xne{Metadata Object Description Schema} - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using language-based tags rather than numeric ones, 282 282 more than Dublin Core. One of endorsed schemas to extend (be used inside) METS. 283 283 284 There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as a n comprehensive standard for resource description and discovery, that howeverwas confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}.284 There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as a comprehensive standard for resource description and discovery that, however, was confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}. 285 285 And although there is still work on RDA, among others by the Library of Congress, there has been no wider adoption of the standard by the LIS community until now. 286 286 287 287 \subsection{ESE, Europeana Data Model - EDM} 288 288 289 Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib})information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}.290 291 For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious,that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}.289 Within the big European initiative \xne{Europeana} (cf. \ref{lit:digi-lib}), information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}. 290 291 For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation}, a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}. 292 292 EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is also already a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the Europeana data in the new format. 293 293 %https://github.com/europeana … … 297 297 \label{refdata} 298 298 299 One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web 300 onepreparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative299 One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web, 300 a preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative 301 301 \url{http://europeanalabs.eu/wiki/WP12Vocabularies}\url{https://europeanalabs.eu/wiki/DesignSemanticThesauri}}. 302 302 303 Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees. 304 305 In the following we inventarize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary} 306 How this resources will be employed is discussed in \ref{sec:values2entities}. 307 Additionally, some verbose commentary follows. 303 Conceptually, we want to partition these resources in two types. On the one hand, abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand, named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight that, while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (\code{sameAs}), for concepts we need to accept a plurality of existing conceptualizations, and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees. 304 305 In the following, we inventorize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary}. How this resources will be employed is discussed in \ref{sec:values2entities}. Additionally, some verbose commentary follows. 308 306 309 307 %\subsubsection{Named entities} 310 308 311 The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called Virtual International Authority File, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications.312 Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html} , however there is only a limited free access and licensed and fee for full access. But recently there work wasannounced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}309 The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called \xne{Virtual International Authority File}, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications. 310 Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}. There is only a limited free access and fee is charged for full access, but recently the provider announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html} 313 311 314 312 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010} 315 313 316 Also to mention \xne{Yago} , a large knowledge base created by MPI informatik integrating dbpedia, geonames and wordnet\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/} \cite{Suchanek2007yago}.314 Also to mention \xne{Yago}\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/}, a large knowledge base created by MPI Informatik integrating dbpedia, geonames and wordnet datasets. \cite{Suchanek2007yago} 317 315 318 316 So we witness a strong general trend towards Semantic Web and Linked Open Data. … … 351 349 352 350 In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology. 353 We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities.351 We also gave an overview of main formats and collections in the domain of Library and Information Services and an inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities. 354 352 355 353 … … 451 449 % \hline 452 450 453 AAT & international Architecture and Arts Thesaurus, Getty \\451 AAT & International Architecture and Arts Thesaurus, Getty \\ 454 452 CONA & Cultural Objects Name Authority \\ 455 453 DAI & Deutsches ArchÀologisches Institut \\ … … 460 458 FAST & Faceted Application of Subject Terminology \\ 461 459 Getty & Getty Research Institute curating the \href{http://www.getty.edu/research/tools/vocabularies/index.html}{vocabularies}, part of Getty Trust \\ 462 GND & \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library \\460 GND & \emph{Gemeinsame Normdatei} - Integrated Authority Files of the German National Library \\ 463 461 GTAA & Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for \& Audiovisual Archives) \\ 464 462 % {quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation} \\ … … 467 465 LCC & Library of Congress Classification \\ 468 466 LCSH & Library of Congress Subject Headings \\ 469 LoC & Library of Congress\furl{http://loc.gov} \\470 OCLC & Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation \\471 PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{ prometheus} KÃŒnstlerNamensansetzungsDatei\\467 LoC & \href{http://loc.gov}{Library of Congress} \\ 468 OCLC & \href{http://www.oclc.org}{Online Computer Library Center} -- world's biggest library federation \\ 469 PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{Prometheus} KÃŒnstlerNamensansetzungsDatei\\ 472 470 RKD & Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History \\ 473 471 TGN & Getty Thesaurus of Geographic Names \\ -
SMC4LRT/chapters/Definitions.tex
r3776 r4117 25 25 RDF & \xne{Resource Description Framework} \cite{RDF2004} \\ 26 26 RR & Relation Registry, cf. \ref{def:rr} \\ 27 TEI & \xne{Text Encoding Initiative}, cf. \ref{ tei} \\27 TEI & \xne{Text Encoding Initiative}, cf. \ref{def:tei} \\ 28 28 \end{tabular} 29 29 \end{table} … … 58 58 \end{table} 59 59 60 \section{Formatting conventions}60 \section{Formatting Conventions} 61 61 62 62 Inline formatting for highlighting: \\ -
SMC4LRT/chapters/Design_SMCinstance.tex
r3776 r4117 1 \chapter{Mapping on instance level,\\ CMD as LOD}1 \chapter{Mapping on Instance Level,\\ CMD as LOD} 2 2 \label{ch:design-instance} 3 3 … … 7 7 8 8 And if you can express these all in RDF, which we can for almost all of them (maybe 9 except the actual language resource ... unless it has a schema adorned9 except for the actual language resource ... unless it has a schema adorned 10 10 with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for 11 11 metadata we have that in the CMDI profiles ...) you could load all the … … 18 18 19 19 20 As described in previous chapters (\ref{ch:infra}, \ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.21 22 One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such . In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.23 24 In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data} \cite{TimBL2006}20 As described in previous chapters (\ref{ch:infra}, \ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants) prompting an urgent need for better means for harmonizing the constrained-field values. 21 22 One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and suchlike. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities. 23 24 In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data} \cite{TimBL2006} 25 25 as well as for real semantic (ontology-driven) search and exploration of the data. 26 26 27 27 The following section \ref{sec:cmd2rdf} lays out how individual parts of the CMD framework can be expressed in RDF. 28 In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod} and \ref{semantic-search} respectively.28 In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod}. 29 29 30 30 \section{CMD to RDF} … … 39 39 \end{itemize} 40 40 41 \subsection{CMD specification}41 \subsection{CMD Specification} 42 42 43 43 The main entity of the meta model is the CMD component and is typed as specialization of the \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It would be natural to translate a CMD element to a RDF property, but it needs to be a class as a CMD element -- next to its value -- can also have attributes. This further implies a property ElementValue to express the actual value of given CMD element. … … 54 54 55 55 \noindent 56 Th isentities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry):56 These entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry): 57 57 58 58 \label{table:rdf-cmd} … … 80 80 \end{example3} 81 81 82 \noindent 82 83 That implies that the \code{@ConceptLink} attribute on CMD elements and components as used in the CMD profiles to reference the data category would be modelled as: 83 84 … … 86 87 \end{example3} 87 88 89 \noindent 88 90 Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms 89 91 used usually directly as data properties: … … 94 96 95 97 \noindent 96 However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications. \cite{Windhouwer2012_LDL}98 However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications. \cite{Windhouwer2012_LDL} 97 99 In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals: 98 100 … … 104 106 105 107 106 \subsection{RELcat - Ontological relations}107 As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples \cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:108 \subsection{RELcat - Ontological Relations} 109 As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples \cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms: 108 110 109 111 \begin{example3} … … 112 114 113 115 \noindent 114 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be und restood as an upper layer of a taxonony of relation types, implying a subtyping:116 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be understood as an upper layer of a taxonomy of relation types, implying a subtyping: 115 117 116 118 \begin{example3} … … 120 122 121 123 122 \subsection{CMD instances}124 \subsection{CMD Instances} 123 125 In the next step, we want to express the individual CMD instances, the metadata records, making use of the previously defined entities on the schema level, but also entities from external ontologies. 124 126 125 127 \subsubsection {Resource Identifier} 126 128 127 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections ,that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>} from \code{cmd:MdSelfLink} element) could be used as the resource identifier.129 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>} from \code{cmd:MdSelfLink} element) could be used as the resource identifier. 128 130 If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}. 129 (Note also ,that one MD record can describe multiple resources, this can be also easily accomodated in OpenAnnotation):131 (Note also that one MD record can describe multiple resources, this can be also easily accomodated in OpenAnnotation): 130 132 131 133 \begin{example3} … … 202 204 203 205 %%%%%%%%%%%%%%%%% 204 \section{Mapping field values to semantic entities}206 \section{Mapping Field Values to Semantic Entities} 205 207 \label{sec:values2entities} 206 208 … … 232 234 \end{example3} 233 235 234 However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept, value pairs (cf. figure \ref{fig:smc_cmd2lod}):236 However, for the needs of the mapping task, we propose to reduce and rewrite to retrieve distinct concept, value pairs (cf. figure \ref{fig:smc_cmd2lod}): 235 237 236 238 \begin{example3} … … 239 241 \end{example3} 240 242 241 \var{lookup} function is a customized version of the \var(map) function , that operates on thisinformation pairs (concept, label).243 \var{lookup} function is a customized version of the \var(map) function that operates on these information pairs (concept, label). 242 244 243 245 The two steps \var{lookup} and \var{assess} correspond exactly to the two steps in \cite{jimenez2012large} in their system \xne{LogMap2}: 1) computation of mapping candidates (maximise recall) and b) assessment of the candidates (maximize precision) … … 252 254 \subsubsection{Identify vocabularies} 253 255 254 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively label ed \code{@clavas:vocabulary}). For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.255 256 The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).256 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labelled \code{@clavas:vocabulary}). For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly. 257 258 The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However, definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}). 257 259 258 260 Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}: … … 280 282 In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing. 281 283 282 \begin{definition}{ signature of the lookup function}284 \begin{definition}{Signature of the lookup function} 283 285 lookup \ ( \ DataCategory \ , \ Literal \ ) \quad \mapsto \quad ( \ Concept \ | \ Entity \ )* 284 286 \end{definition} 285 287 286 In the implementation there needs to be additional initial configuration input, identifying datasets for given data categories,288 In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories, 287 289 which will be the result of the previous step. 288 290 … … 303 305 The lookup is the most sensitive step in the process, as that is the gate between strings and semantic entities. In general, the resulting candidates cannot be seen as reliable and should undergo further scrutiny to ensure that the match is semantically correct. 304 306 305 One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource , to determinewhich specific Academy of Sciences is meant in given resource description.306 307 In some situation th is ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link,that allows even the normal user to report on problems or inconsistencies in CMD records.307 One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource to determine, which specific Academy of Sciences is meant in given resource description. 308 309 In some situation these ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note that the CLARIN search engine VLO provides a feedback link that allows even the normal user to report on problems or inconsistencies in CMD records. 308 310 309 311 … … 317 319 318 320 The technical base for a semantic web application is usually a RDF triple-store as discussed in \ref{semweb-tech}. 319 Given that our main concern is the data itself, their processing and display, we want to rely on stable, robust feature rich solution minimizing the effort to provide the data online. The most promising solution seems to be \xne{Virtuoso}, aintegrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store'').320 321 322 Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger ,than ``just'' the original dataset.321 Given that our main concern is the data themselves, their processing and display, we want to rely on stable, robust feature rich solution minimizing the effort to provide the data online. The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). 322 323 324 Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger than ``just'' the original dataset. 323 325 324 326 \section{Summary} -
SMC4LRT/chapters/Design_SMCschema.tex
r3776 r4117 1 1 2 \chapter{System design -- concept-based mapping on schema level}2 \chapter{System Design -- Concept-based Mapping on Schema Level} 3 3 \label{ch:design} 4 4 … … 6 6 7 7 We start by drawing an overall view of the system, introducing its individual components and the dependencies among them. 8 In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser}an advanced interactive user interface for exploring the CMD data domain is proposed.8 In the next section, the internal data model is presented and explained. In section \ref{sec:cx}, the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx}, we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser}, an advanced interactive user interface for exploring the CMD data domain is proposed. 9 9 10 10 \section{System Architecture} … … 14 14 \begin{figure*} 15 15 \includegraphics[width=0.8\textwidth]{images/SMC_modules.png} 16 \caption{The component view on the SMC - modules and their inter -dependencies}16 \caption{The component view on the SMC - modules and their interdependencies} 17 17 \label{fig:smc_modules} 18 18 \end{figure*} … … 31 31 The component diagram in \ref{fig:smc_modules} depicts the dependencies between the components of the system. The \xne{crosswalk service} uses the set of XSL-stylesheets \xne{smc-xsl} and accesses the CMDI registries: \xne{Component Registry}, \xne{ISOcat DCR} and \xne{RELcat} to retrieve the data. It exposes an interface \xne{cx} to be used by third party applications. The \xne{query expansion} module uses the crosswalk service to rewrite queries, also exposing a corresponding API \xne{qx}. 32 32 33 \xne{SMC Browser} consists of two parts the \xne{smc-stats} and \xne{smc-graph}and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.33 \xne{SMC Browser} consists of two parts, the \xne{smc-stats} and \xne{smc-graph}, and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs. 34 34 35 35 For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}. 36 36 37 \section{Data model}37 \section{Data Model} 38 38 39 39 Before we get to the definition of the actual service, we define the internal data model, divided into of two parts: … … 47 47 In this section, we describe \var{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces. 48 48 49 An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms ,that may not contain whitespaces.49 An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms that may not contain whitespaces. 50 50 51 51 \begin{defcap} … … 73 73 It is important to note that in general \var{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique. 74 74 Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it. 75 However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:76 77 \var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.75 However, there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar: 76 77 \var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However, despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace. 78 78 79 79 \var{profile} is reference to a CMD profile. Again, it can be either the name of the profile \var{profileName} or -- for guaranteed unambiguous reference -- its identifier \var{profileId} as issued by the Component Registry (e.g. \var{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier: … … 85 85 86 86 %\noindent 87 \var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.87 \var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However, longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity. 88 88 89 89 \subsection{Terms} … … 95 95 \subsubsection{Type \code{Term}} 96 96 97 \code{Term} is a polymorph data type ,that can have different sets of attributes depending on the type of data it represents.97 \code{Term} is a polymorph data type that can have different sets of attributes depending on the type of data it represents. 98 98 99 99 \begin{table}[h] 100 \caption{Attributes of \code{Term} when encoding data category }100 \caption{Attributes of \code{Term} when encoding data category (enclosed in \code{Concept})} 101 101 \label{table:terms-attributes-datcat} 102 102 \begin{tabu}{ p{0.1\textwidth} p{0.4\textwidth} >{\footnotesize}X } … … 104 104 \rowfont{\itshape\small} attribute & allowed values & sample value\\ 105 105 \hline 106 \var{concept-id} & PID given by DCR & \code{isocat:DC-2522} \\106 % \var{concept-id} & PID given by DCR & \code{isocat:DC-2522} \\ 107 107 \var{set} & identifier of the DCR \emph{dcrID} & \code{isocat} \\ 108 108 \var{type} & one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\ … … 223 223 224 224 \subsubsection{Type \code{Relation}} 225 As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}). The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated ,that contain more than two equivalent concepts.225 As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}). The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated that contain more than two equivalent concepts. 226 226 227 227 % role="about" … … 261 261 262 262 %%%%%%%%%%%%%%%%%%%%%% 263 \section{cx -- crosswalk service}263 \section{cx -- Crosswalk Service} 264 264 \label{sec:cx} 265 265 266 The crosswalk service offers the functionality ,that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.266 The crosswalk service offers the functionality that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. 267 267 Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}. 268 268 269 269 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications representing the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}). 270 270 271 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts ,that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.271 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields. 272 272 273 273 \subsection{Interface Specification} … … 455 455 The documentation of the XSLT stylesheets and the build process is found in appendix \ref{sec:smc-xsl-docs}. 456 456 457 The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set,that the users cannot change directly. (The changes have to be performed in the upstream registries.)457 The service is implemented as a RESTful service, however, only supporting the GET operation, as it operates on a data set that the users cannot change directly. (The changes have to be performed in the upstream registries.) 458 458 459 459 … … 479 479 \item[\xne{termets}] a list of all available Termsets compiled from the CMD profiles, and available DCRs; for \xne{ISOcat} a termset is generated for every available language 480 480 \item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles 481 \item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile481 \item[\xne{cmd-terms-nested}] as above, however, the \code{Term} elements are nested reflecting the component structure in the profile 482 482 \item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements encoding its properties (\code{id, label} 483 483 \item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map}) 484 \item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute 484 \item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute). 485 485 \end{description} 486 486 487 487 \subsubsection{Operation} 488 For the actual service operation a minimal application has been implemented ,that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.488 For the actual service operation a minimal application has been implemented that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format. 489 489 The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq} library within an \xne{eXist} XML database. 490 490 … … 495 495 Also, use of \emph{other than equivalence} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio. 496 496 497 \section{qx -- concept-based search}497 \section{qx -- Concept-based Search} 498 498 \label{sec:qx} 499 499 To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata. 500 In this section we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.500 In this section, we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user. 501 501 502 502 The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily. 503 503 504 Note , that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is dealt with in \ref{semantic-search}.504 Note that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is tackled in \ref{sec:values2entities} (and also there only rather superficially). 505 505 506 506 Note, also that \emph{query expansion} yet needs to be distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath). … … 509 509 \label{cql} 510 510 As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind. 511 CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50 \cite{Lynch1991}, which is very widely spread in the library networks.512 It was introduced 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been513 transfer ed from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)511 CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50 \cite{Lynch1991}, which is very widely spread in the library networks. 512 It was introduced in 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been 513 transferred from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012 \cite{OASIS2012sru}.) 514 514 515 515 Coming from the libraries world, the protocol has a certain bias in favor of bibliographic metadata. … … 525 525 The query language part (CQL - Context Query Language) defines a relatively complex and complete query language. 526 526 The decisive feature of the query language is its inherent extensibility allowing to define own indexes and operators. 527 In particular, CQL introduces so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.527 In particular, CQL introduces the so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}. 528 528 529 529 The SRU/CQL protocol has also been adopted by the CLARIN community as base for a protocol for federated content search\furl{http://clarin.eu/fcs} (FCS) \cite{stehouwer2012fcs}, which is another argument to use this protocol for metadata search as well, given the inherent interrelation between metadata and content search. … … 541 541 542 542 %\begin{note} 543 Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categoriesin which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).543 Alternatively to the -- potentially costly -- on-the-fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories, in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}). 544 544 %\end{note} 545 545 546 \subsection{SMC as module for Metadata Repository}546 \subsection{SMC as Module for Metadata Repository} 547 547 548 548 As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}). 549 549 550 Metadata repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq} module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module,that provides a user interface widget for formulating the query.550 Metadata Repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq} module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module that provides a user interface widget for formulating the query. 551 551 552 552 \begin{figure*} 553 553 \begin{center} 554 554 \includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png} 555 \caption{The component view on the SMC - modules and their inter-dependencies}555 \caption{The component diagram of the integration of SMC as module within the Metadata Repository} 556 556 \label{fig:modules-mdrepo} 557 557 \end{center} … … 561 561 \subsection{User Interface} 562 562 563 A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically a an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.563 A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries. 564 564 \begin{definition}{Generic data format for structured queries} 565 565 < index, operation, term, boolean >+ … … 581 581 582 582 \noindent 583 Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions. 584 Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labeling the fields of the results, or when providing facets to drill down the search. 585 586 A fundamentally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.) 587 588 Combining the two approaches, we could arrive at a ``smart'' widget a input field with on the fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}. 583 Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions. Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labelling the fields of the results, or when providing facets to drill down the search. 584 585 A fundamentally different approach is the "content first" paradigm that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is that the suggestions are typed, so that the user is informed, from which index given term comes (\concept{person}, \concept{place}, etc.) 586 587 Combining the two approaches, we could arrive at a ``smart'' widget consisting of one input field with on-the-fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}. 589 588 590 589 … … 595 594 As the CMD dataset keeps growing both in numbers and in complexity, the call from the community to provide enhanced ways for its exploration gets stronger. In the following, some design considerations for an application to answer this need are proposed. 596 595 597 While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.596 While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However, this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data. 598 597 599 598 \subsection{Design} … … 615 614 616 615 \subsubsection{Requirements} 617 Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious ,that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.616 Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means. 618 617 619 618 In a basic scenario, user looks for possibly reusable profiles or components, based on some common terms associated with the type of data to be described (e.g. \code{"corpus"}). If the search yields matching profiles or components, the user should be able to view the whole structure of the profiles, explore the definitions for individual components and see which data categories are being referenced for semantic grounding. Furthermore, it has to be possible to view multiple profiles concurrently, in particular to be able to see the components or data categories they share and, vice versa, in which profiles a given data category is referenced. … … 658 657 \end{quotation} 659 658 660 Especially remarkable feature is the possibility to add custom constraints ,that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.659 Especially remarkable feature is the possibility to add custom constraints that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout. 661 660 662 661 \subsubsection{Data preprocessing} 663 662 \label{smc-browser-data-preprocessing} 664 The application operates on a set of static XHTML and JSON data files ,that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S}) via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:663 The application operates on a set of static XHTML and JSON data files that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S}) via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset: 665 664 666 665 \begin{description} … … 677 676 \end{description} 678 677 679 Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However soon it became obvious, that the graph is getting to huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.680 681 T o The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.678 Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However, soon it became obvious that the graph is getting too huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout. 679 680 The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched. 682 681 683 682 … … 698 697 699 698 As proposed in the design section, the starting point when using the SMC browser is the node list on the left, listing all nodes grouped by type (profiles, components, elements, data categories) and sorted alphabetically. This list can be filtered by a simple substring search which is important, as already now there are more than 4.000 nodes in the graph. Individual nodes are selected and deselected by a simple click. All selected nodes are displayed in the main graph pane represented by a circle with a label. The representation is styled by type. Based on the settings in the navigation bar (cf. figure \ref{fig:navbar}), next to the selected nodes also related nodes are displayed. The \code{depth-before} and \code{depth-after} options govern how many levels in each direction are traversed and displayed starting from the set of selected nodes. Option \code{layout} allows to select from one of available layouts -- next to the 700 basic \code{force} layout there are also directed layouts ,that are often better suited for displaying the directed graph.699 basic \code{force} layout there are also directed layouts that are often better suited for displaying the directed graph. 701 700 Other options influence the layouting algorithm (\code{link-distance}, \code{charge}, \code{friction}) and the visual representation of the nodes and edges (\code{node-size, labels, curve}). 702 701 703 One special option is \code{graph} ,that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.702 One special option is \code{graph} that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}. 704 703 705 704 There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described. … … 708 707 \label{smc-browser-extensions} 709 708 710 Next to the basic setup described above, there is a number of possible additional features ,that could enhance the functionality and usefulness of the discussed tool.709 Next to the basic setup described above, there is a number of possible additional features that could enhance the functionality and usefulness of the discussed tool. 711 710 712 711 \subsubsection*{Graph operations -- differential views} … … 717 716 Equipped with a more flexible or modular matching algorithm (additionally to the initially foreseen identity match), the tool could visualize matches between any given schemas, not only CMD-based ones. 718 717 719 Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information ,that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.718 Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc. 720 719 721 720 \subsubsection*{Viewer for external data} 722 The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set) ,that would allow to visualize their data in the SMC browser.721 The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set) that would allow to visualize their data in the SMC browser. 723 722 724 723 One prominent visualization application offering this feature is the geobrowser e4D\furl{http://www.informatik.uni-leipzig.de:8080/e4D/} (currently \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo}, developed in the context of the \xne{europeana connect} initiative), accepting data in KML format. 725 724 726 725 \subsubsection*{Integrate with instance data} 727 The usefulness and information gain of the application could be greatly increased by integrating the instance data . I.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.726 The usefulness and information gain of the application could be greatly increased by integrating the instance data, i.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations. 728 727 729 728 Also such a visualization could feature direct search links from individual nodes into the dataset, i.e. from a profile node a link could lead into a search interface listing metadata records of given profile. … … 731 730 732 731 %%%%%%%%%%%%%%%%%%%%%%%%% 733 \section{Application of \emph{ schema matching} techniques in SMC}732 \section{Application of \emph{Schema Matching} Techniques in SMC} 734 733 \label{sec:schema-matching-app} 735 734 … … 739 738 Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent. 740 739 741 However this is only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD frameworkthe metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.740 However, this only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework, the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry. 742 741 743 742 Let us restate the problem of integrating existing external schemas as an application of \var{schema matching} method: 744 743 The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{Even though within CMDI the data models are called `profiles', we can still refer to them as `schema', because every profile has an unambiguous expression in a XML Schema.} \var{$S_{1..n}$}. 745 It is very improbable ,that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.744 It is very improbable that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}. 746 745 Given the heterogeneity of the schemas present in the field of research, full alignments are not achievable at all. 747 However thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the746 However, thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the 748 747 components \var{c}. Thus the task is to find for every entity $e_{x} \in S_{x}$ the set of semantically equivalent candidate components $\{c_{y}\}$, which corresponds to the definitions of mapping function for single entities as defined in \cite{EhrigSure2004}. 749 Given ,that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).748 Given that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function). 750 749 Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision. 751 750 … … 764 763 the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}. It would be also worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature (compute the longest matching subpath). 765 764 766 Although we ex amplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles, thatthough they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).767 768 Note , that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.765 Although we exemplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles that, though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}). 766 767 Note that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency prevails. 769 768 770 769 The question is, what to do with the new correspondences that would possibly be determined, when, as proposed, we would apply the schema matching on the integrated schemas. One possibility is to add a data category, if one of the pair is still one missing. 771 However if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).770 However, if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}). 772 771 773 772 Once all the equivalences (and other relations) between the profiles/schemas were found, simliarity ratios can be determined. 774 773 This new simliarity ratios could be applied as alternative weights in the profiles-similarity graph \ref{sec:smc-cloud}. 775 774 776 In contrast to the task described here ,that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',775 In contrast to the task described here that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'', 777 776 another aspect within this work is clearly situated in the Semantic Web domain and requires application of ontology matching methods -- the mapping of field values to semantic entities described in \ref{sec:values2entities}. 778 777 779 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.778 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}. 780 779 781 780 782 781 783 782 \section{Summary} 784 In this core chapter, we la yed out a design for a system dealing with concept-based crosswalks on schema level.783 In this core chapter, we laid out a design for a system dealing with concept-based crosswalks on schema level. 785 784 The system consists of three main parts: the crosswalk service, the query expansion module and \xne{SMC Browser} -- a tool for visualizing and exploring the schemas and the corresponding crosswalks. 786 785 In addition, we elaborated on the application of schema matching methods to infer mappings between schemas. -
SMC4LRT/chapters/Infrastructure.tex
r3776 r4117 1 \chapter{Underlying infrastructure}1 \chapter{Underlying Infrastructure} 2 2 \label{ch:infra} 3 3 … … 7 7 \label{def:CLARIN} 8 8 9 CLARIN - Common Language Resource and Technology Infrastructure \cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide9 CLARIN - Common Language Resource and Technology Infrastructure \cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide 10 10 11 11 \begin{quote} 12 \dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located. \cite{CLARIN2013web}12 \dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located. \cite{CLARIN2013web} 13 13 \end{quote} 14 14 … … 19 19 The initiative foresees a federated network of centres providing resources and services in a harmonized, interoperable manner to the academic community in all participating countries. 20 20 21 In the preparation phase of the project 2008 - 2011 over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and administrative decision bodies ensuring the flow of information and coherent action on European level.21 In the preparation phase of the project 2008 - 2011, over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and administrative decision bodies ensuring the flow of information and coherent action on European level. 22 22 23 23 Since 2013, CLARIN also became an \emph{European Research Infrastructure Consortium} (ERIC), which is a new type of legal entity established within EU, especially designed to give the research infrastructure initiatives a more stable status and better means to act independently. This is an important step to ensure a continuity of the endeavour, the chronic problem of (international) projects. … … 27 27 \label{def:CMDI} 28 28 29 One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework} \cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}).29 One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework} \cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}). 30 30 31 31 The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide in \ref{cmdi-registries}: … … 38 38 39 39 \noindent 40 All these modules are running services ,that this work shall directly build upon.40 All these modules are running services that this work shall directly build upon. 41 41 42 42 In contrast, SMC is meant as provider for the modules on the exploitation side of the infrastructure, i.e. search and exploration services used by the end users. These are briefly introduced in \ref{cmdi_exploitation}. … … 60 60 Finally, the Vocabulary Alignment Service, a module playing crucial role in metadata curation, is treated separately in section \ref{sec:cv}. 61 61 62 \subsection{CMDI registries}62 \subsection{CMDI Registries} 63 63 \label{cmdi-registries} 64 64 The CMD framework as data model (cf. \ref{def:CMD}) together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. See figure \ref{fig:cmdi-old} with the rather na\"{i}ve initial vision of the system contrasted with the figure \ref{fig:SMC-linkage} detailing the actual linkage between the data in the individual registries. In the following, we explain briefly their role and interaction. … … 66 66 \begin{figure*}[t] 67 67 \includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2} 68 \caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping }68 \caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping.} 69 69 \label{fig:SMC-linkage} 70 70 \end{figure*} … … 79 79 Next to a web interface for users to browse and manage the data categories, ISOcat provides a REST-style webservice allowing applications to retrieve the data category specifications. By default, it is provided in the \xne{Data Category Interchange Format - DCIF}, the standardized XML-serialization of the data model, but a RDF and HTML representation is available as well. 80 80 81 The core data model defining the data category specification is rather complex, consisting of administrative, linguistic and description part, containing language-specific versions of definitions, value domains, examples and other attributes (cf. \ref{fig:DCR_data_model} for the diagram of the full data model). Following types of data categories are recognized (cf. figure \ref{fig:dc_type}): \var{simple, complex}: (\var{closed, open} or \var{constrained}), \var{container}. One fundamental aspect to emphasize is ,that the data categories are assigned a persistent identifier, making them globally and permanently referable.81 The core data model defining the data category specification is rather complex, consisting of administrative, linguistic and description part, containing language-specific versions of definitions, value domains, examples and other attributes (cf. \ref{fig:DCR_data_model} for the diagram of the full data model). Following types of data categories are recognized (cf. figure \ref{fig:dc_type}): \var{simple, complex}: (\var{closed, open} or \var{constrained}), \var{container}. One fundamental aspect to emphasize is that the data categories are assigned a persistent identifier, making them globally and permanently referable. 82 82 83 83 \begin{figure*}[!ht] … … 85 85 \includegraphics[width=0.7\textwidth]{images/dc_types} 86 86 \end{center} 87 \caption{Data Category types \cite{Windhouwer2011ISOcat_intro}}87 \caption{Data Category types \cite{Windhouwer2011}} 88 88 \label{fig:dc_type} 89 89 \end{figure*} … … 92 92 \label{def:CR} 93 93 94 \emph{Component Registry}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} (CR) implements the CMD data model (cf. \ref{def:CMD}) and fulfills two functions. For one, it is the actual registry that persistently stores and exposes published CMD profiles via a web interface allowing to browse and search in them and view their structure accompani ged by a REST webservice to allows client applications to retrieve the profile definitions. At the same timethe web interface serves as an editor for creating and editing new CMD components and profiles.95 96 The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., have some metadata elements and/or components added or removed. Also new components can be created if needed to model the unique aspects of the resources under consideration. \cite{Durco2013_MTSR}97 98 Let us reiterate , that the actual core provision for semantic interoperability is the requirement that the elements (and as far as possible also components and values) should be linked ``via a PID to exactly one data category (cf. \ref{def:DCR}), thus indicating unambiguously how the content of the field in a metadata description should be interpreted''\cite{Broeder+2010}, or \emph{to make its semantics explicit}.94 \emph{Component Registry}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} (CR) implements the CMD data model (cf. \ref{def:CMD}) and fulfills two functions. For one, it is the actual registry that persistently stores and exposes published CMD profiles via a web interface allowing to browse and search in them and view their structure accompanied by a REST webservice to allow client applications to retrieve the profile definitions. At the same time, the web interface serves as an editor for creating and editing new CMD components and profiles. 95 96 The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., have some metadata elements and/or components added or removed. Also new components can be created if needed to model the unique aspects of the resources under consideration. \cite{Durco2013MTSR} 97 98 Let us reiterate that the actual core provision for semantic interoperability is the requirement that the elements (and as far as possible also components and values) should be linked ``via a PID to exactly one data category (cf. \ref{def:DCR}), thus indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}, or \emph{to make its semantics explicit}. 99 99 100 100 As dictated by the CMD model, all components needed for the modelled resource description are compiled into one profile. … … 104 104 105 105 The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions. 106 However there needs to be an additional meansto capture information about relations between data categories.107 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design decision is based upon the assumption that the relationsbe under control of the metadata user whereas the data categories are under control of the metadata modeller.106 However, there needs to be an additional mean to capture information about relations between data categories. 107 This information was deliberately not included in the DCR, because relations often depend on the context, in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design decision is based upon the assumption that the relations need to be under control of the metadata user whereas the data categories are under control of the metadata modeller. 108 108 109 109 The relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed. 110 110 111 There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen \cite{Windhouwer2011,SchuurmanWindhouwer2011},that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.111 There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen \cite{Windhouwer2011,SchuurmanWindhouwer2011} that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}. 112 112 This implementation stores the individual relations as RDF triples allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications. 113 113 … … 116 116 \end{definition} 117 117 118 \subsection{Further parts of the infrastructure}118 \subsection{Further Parts of the Infrastructure} 119 119 \label{cmdi-other} 120 120 … … 124 124 \begin{quotation} 125 125 RELcat and SCHEMAcat will provide the means to harvest and specify this information in the form of relationships and allow 126 (search) algorithms to traverse the semantic graph thus made explicit \cite{Schuurman2011_SCHEMAcat}.126 (search) algorithms to traverse the semantic graph thus made explicit \cite{SchuurmanWindhouwer2011}. 127 127 \end{quotation} 128 128 129 129 \subsubsection*{Schema Parser} 130 Schema Parser is a service developed at the Meertens Institute, Amsterdam ,that processes XML Schemas to generate all possible paths in the instance data. It is used primarily as auxiliary service to the search engine developed at the same institute, presented in the following subsection.130 Schema Parser is a service developed at the Meertens Institute, Amsterdam that processes XML Schemas to generate all possible paths in the instance data. It is used primarily as auxiliary service to the search engine developed at the same institute, presented in the following subsection. 131 131 132 132 \subsubsection*{Metadata editors} … … 137 137 138 138 Given that the Component Registry generates a XML schema for every profile, basically any generic XML editor with schema validation can be used (e.g. the wide-spread \xne{oXygen}). However, there have been efforts within the CLARIN community to develop dedicated tools, tailor-made for creation of CMD records. 139 Two examples being the stand-alone application \xne{Arbil}\ cite{withers2012arbil}\furl{http://tla.mpi.nl/tools/tla-tools/arbil/} being developed at Max Planck Institute for Psycholinguistics, Nijmegen and the web-based application developed within the project \xne{NaLiDa}\cite{dima2012mdeditor}\furl{http://www.sfs.uni-tuebingen.de/nalida/en/} at the Seminar fÃŒr Sprachwissenschaft University TÃŒbingen.140 141 142 \subsection{CMDI exploitation side}139 Two examples being the stand-alone application \xne{Arbil}\furl{http://tla.mpi.nl/tools/tla-tools/arbil/} \cite{withers2012arbil} being developed at Max Planck Institute for Psycholinguistics, Nijmegen and the web-based application developed within the project \xne{NaLiDa}\furl{http://www.sfs.uni-tuebingen.de/nalida/en/} \cite{dima2012mdeditor} at the Seminar fÃŒr Sprachwissenschaft University TÃŒbingen. 140 141 142 \subsection{CMDI Exploitation Side} 143 143 \label{cmdi_exploitation} 144 Metadata complying with the CMD data model is being created by a growing number of institutions by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints. These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications ,that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}).144 Metadata complying with the CMD data model is being created by a growing number of institutions by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints. These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}). 145 145 146 146 \begin{figure*}[!ht] 147 147 \begin{center} 148 148 \includegraphics[width=0.8\textwidth]{images/CMDingestion_woVAS} 149 \caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by search applications }149 \caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by search applications.} 150 150 \label{fig:cmd-ingestion} 151 151 \end{center} 152 152 \end{figure*} 153 153 154 The first stable and publicly available application providing access to the collected metadata of CMDI has been the \xne{VLO - Virtual Language Observatory}\furl{http://www.clarin.eu/vlo/} \cite{VanUytvanck2010}, developed by the Technical Group at the MPI for Psycholinguistics, Nijmegen, based on the wide-spread full-text search engine \xne{Apache Solr}\furl{http://lucene.apache.org/solr/}.154 The first stable and publicly available application providing access to the collected metadata of CMDI has been the \xne{VLO - Virtual Language Observatory}\furl{http://www.clarin.eu/vlo/} \cite{VanUytvanck2010}, developed by the Technical Group at the MPI for Psycholinguistics, Nijmegen, based on the wide-spread full-text search engine \xne{Apache Solr}\furl{http://lucene.apache.org/solr/}. 155 155 The application employs a faceted search with 10 fixed facets (figure \ref{fig:vlo}). 156 156 As the processed metadata records are instances of different CMD profiles and thus have very differing structures, to map the fields in the records onto the facets the application relies on the data category references in the underlying schemas, effectively making use of this basic layer of semantic interoperability provided by the infrastructure. … … 159 159 \begin{center} 160 160 \includegraphics[width=0.8\textwidth]{images/screen_VLO_overview.png} 161 \caption{ screenshot of the faceted browser of the VLO}161 \caption{Screenshot of the faceted browser of the VLO} 162 162 \label{fig:vlo} 163 163 \end{center} 164 164 \end{figure*} 165 165 166 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated indexing process and search interface \cite{zhang2012cmdi}. Instead of reducing the data into a fixed number of indexes or facets, the application employs the aforementioned \xne{Schema Parser} to dynamically generate an index configuration that covers all data, again relying on the data categories to merge information from semantically equivalent metadata fields in the different schemas into a common index.166 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It is also based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated indexing process and search interface \cite{Zhang2012cmdi}. Instead of reducing the data into a fixed number of indexes or facets, the application employs the aforementioned \xne{Schema Parser} to dynamically generate an index configuration that covers all data, again relying on the data categories to merge information from semantically equivalent metadata fields in the different schemas into a common index. 167 167 The application also offers some innovative solutions on the user interface, like search by similarity, content-first search or specialized contextual widgets visualizing the time dimension, the geographic information and other derived data. 168 168 % \todoin { describe indexing and search} 169 169 170 And finally, there is the \xne{Metadata Repository}, being developed by the author as a XQuery application in the XML database \xne{eXist}, originally (in the initial blueprints of the infrastructure) foreseen as main storage of the collected metadata with the \xne{Metadata Service} on top providing search access to the data optionally applying \xne{Semantic Mapping} to expand user queries (cf. figure \ref{fig:cmdi-old}) . \cite{Durco2011}171 However the application still did not reach production quality, and is used rather as experimenting field for the author. Meanwhile the functionality of the Metadata Service had been integrated directly into the Metadata Repository together with the auxiliary use of Semantic Mapping, making it the implementation of the semantic search module as proposed in this work (cf. \ref{sec:qx}).170 And finally, there is the \xne{Metadata Repository}, being developed by the author as a XQuery application in the XML database \xne{eXist}, originally (in the initial blueprints of the infrastructure) foreseen as main storage of the collected metadata with the \xne{Metadata Service} on top providing search access to the data optionally applying \xne{Semantic Mapping} to expand user queries (cf. figure \ref{fig:cmdi-old}) \cite{Durco2011}. 171 However, the application still did not reach production quality, and is used rather as experimenting field for the author. Meanwhile the functionality of the Metadata Service had been integrated directly into the Metadata Repository together with the auxiliary use of Semantic Mapping, making it the implementation of the semantic search module as proposed in this work (cf. \ref{sec:qx}). 172 172 173 173 %%%%%%%%%%%%%%%%%%%% … … 175 175 \label{sec:cv} 176 176 177 \subsection{Motivation \& broader context}178 The provisions for data harmonization and semantic interoperability as presented until now pertain mostly to the schema level. However the problem of incoherent labeling and nomenclature is even more virulent in the actual metadata fields on the instance level. While for a number of fields the value domain can be enforced through schema validation, many fields (e.g. \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.177 \subsection{Motivation \& Broader Context} 178 The provisions for data harmonization and semantic interoperability as presented until now pertain mostly to the schema level. However, the problem of incoherent labelling and nomenclature is even more virulent in the actual metadata fields on the instance level. While for a number of fields the value domain can be enforced through schema validation, many fields (e.g. \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants) prompting an urgent need for better means for harmonizing the constrained-field values. 179 179 180 180 This issue is to be seen in a broader context of a general need for reliable community-shared registry services for concepts, controlled vocabularies and reference data in both the LRT and Digital Humanities community, applicable in a range of applications and tasks like data enrichment and annotation, metadata generation and curation, data analysis, etc. … … 183 183 Consequently, activities with regard to controlled vocabularies are ongoing not only in CLARIN, but also within the sister ESFRI project DARIAH. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight synergic cooperation between individual initiatives. 184 184 185 It has to be also kept in mind ,that a hoist of work on controlled vocabularies has already been done and a large body of data is present in individual specialized communities (taxonomies) as well as -- with more general scope -- in the libraries world (authority files).185 It has to be also kept in mind that a hoist of work on controlled vocabularies has already been done and a large body of data is present in individual specialized communities (taxonomies) as well as -- with more general scope -- in the libraries world (authority files). 186 186 187 187 \begin{comment} … … 196 196 \label{def:CLAVAS} 197 197 198 In the context of CLARIN (primarily CLARIN-NL), a concrete initiative has been conducted -- \xne{Vocabulary Alignment Service for CLARIN} or CLAVAS -- with the objective to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor \xne{OpenSKOS}\furl{http://openskos.org}, developed and run within the dutch program \xne{CATCHplus}\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}.198 In the context of CLARIN (primarily CLARIN-NL), a concrete initiative has been conducted -- \xne{Vocabulary Alignment Service for CLARIN} or CLAVAS -- with the objective to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor \xne{OpenSKOS}\furl{http://openskos.org}, developed and run within the Dutch program \xne{CATCHplus}\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. 199 199 200 200 %As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with. 201 201 202 202 The basic idea of this repository is to serve as a project independent manager and provider of controlled vocabularies, as an exchange platform for data in SKOS format. 203 One important feature of the \xne{OpenSKOS} system is its distributed architecture. Multiple instances can be set up ,that can synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, in which multiple instances provide identical synchronized data, with organizations behind individual instances assuming the primary responsibility for individual vocabularies based on their specialization or field of expertise.204 205 Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/}, as well as Austrian Centre for Digital Humanities at the Austrian Academy of Sciences are running a instance of the OpenSKOS system.206 207 As the work on this vocabulary repository started in the context of a cultural heritage program , originally it served vocabularies not directly relevant for the LRT-community \concept{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \concept{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}. Within the CLAVAS, a number of vocabularies relevant for the CLARIN and LRT-community were identified,that will be gradually integrated into the vocabulary repository. (See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies.) Following vocabularies were already integrated into the \xne{CLAVAS} instance of OpenSKOS:203 One important feature of the \xne{OpenSKOS} system is its distributed architecture. Multiple instances can be set up that can synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, in which multiple instances provide identical synchronized data, with organizations behind individual instances assuming the primary responsibility for individual vocabularies based on their specialization or field of expertise. 204 205 Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/}, as well as Austrian Centre for Digital Humanities at the Austrian Academy of Sciences are running an instance of the OpenSKOS system. 206 207 As the work on this vocabulary repository started in the context of a cultural heritage programme, originally it served vocabularies not directly relevant for the LRT-community \concept{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \concept{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}. Within the CLAVAS, a number of vocabularies relevant for the CLARIN and LRT-community were identified that will be gradually integrated into the vocabulary repository. (See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies.) Following vocabularies were already integrated into the \xne{CLAVAS} instance of OpenSKOS: 208 208 \begin{itemize} 209 \item the list of language codes \cite{ISO639}209 \item the list of language codes \cite{ISO639} 210 210 \item organization names for the domain of language resources 211 211 \item a number of data categories from ISOcat (see \ref{sec:export-dcr} for details of the process) … … 215 215 \label{sec:export-dcr} 216 216 217 Based on the premise , that the data in DCR also represents a kind of a controlled vocabularies, there is an effort to export data categories in SKOS format and import them into the Vocabulary Service.218 219 Note , that there are two interaction paths between the ISOcat and the Vocabulary Service. The first, importing certain data categories from ISOcat into the Vocabulary Service, is described in this section. The second aspect (described in next section \ref{interaction-dcr-skos}) is,that the value domains of certain data categories are defined by reference to a vocabulary maintained in the Vocabulary Service.217 Based on the premise that the data in DCR also represents a kind of a controlled vocabulary, there is an effort to export data categories in SKOS format and import them into the Vocabulary Service. 218 219 Note that there are two interaction paths between the ISOcat and the Vocabulary Service. The first, importing certain data categories from ISOcat into the Vocabulary Service, is described in this section. The second aspect (described in next section \ref{interaction-dcr-skos}) is that the value domains of certain data categories are defined by reference to a vocabulary maintained in the Vocabulary Service. 220 220 221 221 The fact that data categories are basically definitions of concepts may mislead to 222 222 a na\"{i}ve approach to mapping DCR data to SKOS, namely mapping every data category to a \code{skos:Concept} 223 all of them belonging to the \code{ISOcat:ConceptScheme}. However the data in ISOcat aswhole is too disparate in scope for such a vocabulary to be useful.223 all of them belonging to the \code{ISOcat:ConceptScheme}. However, the data in ISOcat as a whole is too disparate in scope for such a vocabulary to be useful. 224 224 225 225 A more sensible approach is to export only closed DCs (with explicitely defined value domain, cf. \ref{def:DCR}) as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{skos:Concepts} within that scheme. 226 226 227 227 \begin{quotation} 228 The rationale is ,that if we see a vocabulary as a set of possible values for a228 The rationale is that if we see a vocabulary as a set of possible values for a 229 229 field/element/attribute, complex DCs in ISOcat are the users of such 230 230 vocabularies and simple DCs the DCR equivalence of values in such a 231 vocabulary. \cite{Menzo2013mail}231 vocabulary. \cite{Menzo2013mail} 232 232 \end{quotation} 233 233 234 234 \begin{comment} 235 Still there are some closed DCs which might be good vocabulary235 Still there are some closed DCs, which might be good vocabulary 236 236 providers, e.g., /linguistic subject/ (DC-2527/), and still also need to 237 237 stay in ISOcat. I think at some point we should create a smaller set of … … 240 240 then 20, 50 or 100 values are exported. 241 241 242 However it needs to be yet assessed how useful this approach is. In the metadata profile242 However, it needs to be yet assessed how useful this approach is. In the metadata profile 243 243 there are many closed DCs with small value domains. How useful are those 244 244 in CLAVAS? … … 253 253 \end{figure*} 254 254 255 Another aspect is ,that a simple DC can be in value domains of multiple closed DCs.255 Another aspect is that a simple DC can be in value domains of multiple closed DCs. 256 256 Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}. 257 257 So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts]. … … 260 260 Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created, 261 261 i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using \code{<dcr:datcat/>} (and \code{<dcterms:source/>}). 262 This is ,how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest262 This is how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest 263 263 /representations/dcs2/clavas.xsl} 264 264 265 265 266 \subsection{Linking to vocabularies in data categories and schemas -- interaction between ISOcat, CLAVAS and client applications}266 \subsection{Linking to Vocabularies in Data Categories and Schemas -- Interaction between ISOcat, CLAVAS and Client Applications} 267 267 \label{interaction-dcr-skos} 268 268 269 269 In the following, we elaborate on the possible ways to model references to vocabularies in data category specification and to 270 convey that information to the client application. As of the writing, this is work in progress with some design decision yet to be made.\footnote{Large parts of this subsection come from email correspondence with M. Windhouwer in spring 2013. \cite{Menzo2013mail}}270 convey that information to the client application. As of the writing, this is work in progress with some design decision yet to be made.\footnote{Large parts of this subsection come from email correspondence with M. Windhouwer in spring 2013. \cite{Menzo2013mail}} 271 271 272 272 Providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository: 273 273 274 274 \begin{quotation} 275 Originally, the vocabulary repository has been conceived to manage rather large and complex value domains ,that do not fit easily in the DCR data model. Where the value domains are big (ISO 639-3) or can only be275 Originally, the vocabulary repository has been conceived to manage rather large and complex value domains that do not fit easily in the DCR data model. Where the value domains are big (ISO 639-3) or can only be 276 276 partially enumerated (organization names) ISOcat can't/shouldn't contain 277 277 the value domains but just refer to CLAVAS, i.e., ISOcat wouldn't be a 278 provider. \cite{Menzo2013mail}278 provider. \cite{Menzo2013mail} 279 279 \end{quotation} 280 280 … … 290 290 \end{lstlisting} 291 291 292 A proposal by Windhouwer \cite{Menzo2013mail} for integration with CLAVAS foresees following extension:292 A proposal by Windhouwer \cite{Menzo2013mail} for integration with CLAVAS foresees following extension: 293 293 294 294 \begin{lstlisting} … … 298 298 \begin{quotation} 299 299 \code{@href} points to the vocabulary. Actually a PID should be used in the context 300 of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency th en the core.300 of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency than the core. 301 301 302 302 \code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are … … 304 304 \end{quotation} 305 305 306 This yields a definition of the value domain for the data category, where the new rule pointing to the vocabulary is \emph{added} (cf. listing \ref{lst:dcif-conceptualDomain}), so that -- once the information from the DC specification gets into the schema -- tools that don't support vocabulary lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.306 This yields a definition of the value domain for the data category, where the new rule pointing to the vocabulary is \emph{added} (cf. listing \ref{lst:dcif-conceptualDomain}), so that -- once the information from the DC specification gets into the schema -- tools that don't support vocabulary lookup, but are capable of XSD/RNG validation, can still use the regular expression based definition. 307 307 308 308 \lstset{language=XML} 309 \begin{lstlisting}[label=lst:dcif-conceptualDomain, caption= definition of conceptualDomain for the data category \concept{languageID} employing the proposed extension for pointing to a vocabulary]309 \begin{lstlisting}[label=lst:dcif-conceptualDomain, caption=Definition of conceptualDomain for the data category \concept{languageID} employing the proposed extension for pointing to a vocabulary] 310 310 <dcif:conceptualDomain type="constrained"> 311 311 <dcif:dataType>string</dcif:dataType> … … 331 331 \end{figure*} 332 332 333 It is important to emphasize , that anything stated in the DC specification is not binding (even if the DC is of type \var{closed}), but rather a non-normative hint or recommendation. The authoritative source is the schema. A schema modeller binding an element in a schema to a data category can still decide to have other restriction for the values domain of that element then the ones suggested in the DC specification. This applies equally to the proposed vocabulary reference mechanism: The author of the data category suggests a vocabulary to be used for values of given data category, but the metadata modeller decides, if and how this vocabulary will be integrated into the modelled schema.334 335 There are basically two options ,how the vocabulary can be integrated into the schema.333 It is important to emphasize that anything stated in the DC specification is not binding (even if the DC is of type \var{closed}), but rather a non-normative hint or recommendation. The authoritative source is the schema. A schema modeller binding an element in a schema to a data category can still decide to have other restriction for the values domain of that element than the ones suggested in the DC specification. This applies equally to the proposed vocabulary reference mechanism: The author of the data category suggests a vocabulary to be used for values of given data category, but the metadata modeller decides, if and how this vocabulary will be integrated into the modelled schema. 334 335 There are basically two options how the vocabulary can be integrated into the schema. 336 336 One approach is to explicitly enumerate all the values from the vocabulary. 337 Within CMD this has been done in the component for language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. This method allows to strictly validate given metadata field, however there is clearly a limit to this approach in terms of a) size of the vocabulary\footnote{e.g. \xne{ISO-639} contains 7.679 items (language codes) adding some 2MB to each schema referencing it}, b) completeness -- most of the vocabularies cannot be seen as closed, i.e. they represent only a partial enumeration just providing a recommended label for an entity,337 Within CMD this has been done in the component for language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. This method allows to strictly validate given metadata field, however, there is clearly a limit to this approach in terms of a) size of the vocabulary\footnote{e.g. \xne{ISO-639} contains 7.679 items (language codes) adding some 2MB to each schema referencing it}, b) completeness -- most of the vocabularies cannot be seen as closed, i.e. they represent only a partial enumeration just providing a recommended label for an entity, 338 338 and c) stability or change rate -- even the supposedly fixed list of language-codes \xne{ISO-639-*} undergoes regular changes -- it is being updated semi-annually, with entries being added, deleted, merged and split.\furl{http://www-01.sil.org/iso639-3/changes.asp} 339 339 340 340 The other ``soft'' alternative is to convey the information about data category and vocabulary in the schema as annotation, either in \code{<xs:app-info>} element or by some attribute in dedicated namespace. This method is already being employed in the Component Registry indicating data category of a generated element with the \code{@dcr:datcat} attribute. 341 341 342 Once the data category and vocabulary reference end up in the specification of the CMD profile and the derived XSD, the information can finally be used by client applications (like metadata editor)\footnote{Note though ,that this is not a standard mechanism but rather a convention. The client application must implement it in order to be able to make use of it.}. The tool342 Once the data category and vocabulary reference end up in the specification of the CMD profile and the derived XSD, the information can finally be used by client applications (like metadata editor)\footnote{Note though that this is not a standard mechanism but rather a convention. The client application must implement it in order to be able to make use of it.}. The tool 343 343 can use the reference to the data category to fetch explanations (semantic information) (and translations) from ISOcat and it can access the autocomplete/search interface of the Vocabulary Service to offer the user suggestions from the recommended vocabulary (cf. figure \ref{fig:concept_linking}). 344 344 345 The drawback of this variant is ,that we gave up the validation. This345 The drawback of this variant is that we gave up the validation. This 346 346 isn't a problem if the vocabulary is of \code{@type=open}, e.g. \concept{organisation names}, but 347 it is when the value domain is closed, e.g. \concept{languageI d}. In the latter case,347 it is when the value domain is closed, e.g. \concept{languageID}. In the latter case, 348 348 the XSD generation could support both modes: a lax (smaller) version which 349 349 doesn't contain the closed vocabulary as an enumeration and leaves it to 350 the tool, and a strict version which does contain the vocabulary as an350 the tool, and a strict version, which does contain the vocabulary as an 351 351 enumeration. Probably the latter should stay the default, but the client application could 352 352 request the lax version leading to smaller and quicker XSD validation 353 353 inside the tool. 354 354 355 %However for the presumably default (and recommended) scenario, where the modeller wants to use the information from the data category, the \xne{Component Editor} could offer to take over the data type and the vocabulary reference from the linked DC specification.355 %However, for the presumably default (and recommended) scenario, where the modeller wants to use the information from the data category, the \xne{Component Editor} could offer to take over the data type and the vocabulary reference from the linked DC specification. 356 356 357 357 358 358 %%%%%%%%%%%%%%%%% 359 \section{Other aspects of the infrastructure}360 While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources.359 \section{Other Aspects of the Infrastructure} 360 While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However, it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources. 361 361 362 362 \subsubsection{CLARIN Centres} … … 367 367 \end{quotation} 368 368 369 CLARIN imposes a number of criteria , that each centre needs to fulfill to become a CLARIN Centre\furl{http://www.clarin.eu/node/3767}\cite{CE-2013-0095}.369 CLARIN imposes a number of criteria that each centre needs to fulfill to become a CLARIN Centre\furl{http://www.clarin.eu/node/3767} \cite{CE-2013-0095}. 370 370 CLARIN also maintains a central registry, the \xne{Centre Registry}\furl{https://centerregistry-clarin.esc.rzg.mpg.de/}, maintaining structured information about every centre, meant as primary entry point into the CLARIN network of centres. 371 371 372 One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties researchers (not just the home users) to store research data.372 One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties' researchers (not just the home users) to store research data. 373 373 374 374 \begin{comment} … … 394 394 \subsubsection{Federated Content Search} 395 395 396 Another aspect of the availability of resources is , that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, butmainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}.396 Another aspect of the availability of resources is that while metadata can be harvested and indexed locally in one repository this is not possible with the content itself, both due to the size of the data and mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}. 397 397 398 398 Note that in practice the line between metadata and content data is not so clear -- usually there is a need to filter by metadata even when searching in content. Therefore also most content search engines feature some kind of metadata filters. Thus it seems reasonable to harmonize the search protocol and query language for metadata and content. This proposition is further elaborated on in \ref{cql}. … … 400 400 \section{Summary} 401 401 402 In this chapter we presented individual parts of the infrastructure, next to the core registries: ISOcat Data Category Registry, Component Registry and Relation Registry, that this work directly builds upon, a number of other services and application forming the CLARIN ecosystem were briefly introduced. A separate consideration was dedicated to the issue of controlled vocabularies together with a related module the Vocabulary Alignment Service (and its implementation OpenSKOS) that allows to manage vocabularies and use them in client application. Finally a few other aspects of the infrastructure, that are equally important, howevernot pertaining to the metadata level, were briefly tackled.403 402 In this chapter, we presented individual parts of the infrastructure, next to the core registries: ISOcat Data Category Registry, Component Registry and Relation Registry that this work directly builds upon, a number of other services and application forming the CLARIN ecosystem were briefly introduced. A separate consideration was dedicated to the issue of controlled vocabularies together with a related module the Vocabulary Alignment Service (and its implementation OpenSKOS) that allows to manage vocabularies and use them in client application. Finally, a few other aspects of the infrastructure that are equally important, however, not pertaining to the metadata level, were briefly tackled. 403 -
SMC4LRT/chapters/Introduction.tex
r3776 r4117 4 4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 5 5 6 \section{Motivation / problem statement}6 \section{Motivation / Problem Statement} 7 7 8 8 While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.) 9 9 10 This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} ( cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.10 This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (CMDI, cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. 11 11 12 This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.12 This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} (SMC) -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources. 13 13 14 14 \section{Main Goal} … … 40 40 Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. The task of the crosswalk service -- the primary part of the SMC module -- is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas that can serve as basis for concept-based search. 41 41 42 Thus, the goal is not primarily to produce the crosswalks but rather to develop theservice serving existing ones.42 Thus, the goal is not primarily to define new crosswalks but rather to develop a service serving existing ones. 43 43 44 44 \subsubsection*{Concept-based query expansion} … … 48 48 \paragraph{Example} 49 49 Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be \emph{expanded} to 50 all the semantically near fields (\emph{concept cluster}) , that are howeverlabelled (or even structured) differently in other schemas like:50 all the semantically near fields (\emph{concept cluster}) that are however, labelled (or even structured) differently in other schemas like: 51 51 52 52 \begin{quote} … … 54 54 \end{quote} 55 55 56 The expansion cannot be solved by simple string matching, as there are other fields label ed with the same (sub)strings but with different semantics,that shouldn't be considered:56 The expansion cannot be solved by simple string matching, as there are other fields labelled with the same (sub)strings but with different semantics that shouldn't be considered: 57 57 58 58 \begin{quote} … … 62 62 \subsubsection*{Semantic interpretation} 63 63 64 The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the instance datashows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies.64 The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the evidence in the metadata records collected within CMDI shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies. 65 65 66 66 \subsubsection*{Ontology-driven data exploration} … … 75 75 76 76 \section{Method} 77 We start with examining the existing data and with the description of the existing infrastructure in which this work is embedded.77 We start with examining the existing data and with the description of the existing infrastructure, in which this work is embedded. 78 78 79 79 Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure. … … 90 90 Once the dataset is expressed in RDF, it can be exposed via a semantic web application and published as another nucleus of \emph{Linked Open Data} in the global \emph{Web Of Data}. 91 91 92 A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work.92 A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however, this issue can only be tackled marginally and will have to be outsourced into future work. 93 93 94 94 \section{Expected Results} … … 108 108 \end{description} 109 109 110 \section{Structure of the work}110 \section{Structure of the Work} 111 111 The work starts with examining the state of the art work in the two fields language resources and technology and semantic web technologies in chapter \ref{ch:lit}. In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work. 112 112 … … 116 116 The results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future. 117 117 118 The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref} and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}).118 The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref}) and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}). 119 119 120 120 -
SMC4LRT/chapters/Literature.tex
r3776 r4117 4 4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 5 5 6 In this chapter we give a short overview of the development of large research infrastructures (with focus on those for language resources and technology), then we examine in more detail the hoist of work (methods and systems) on schema/ontology matching6 In this chapter, we give a short overview of the development of large research infrastructures (with focus on those for language resources and technology), then we examine in more detail the hoist of work (methods and systems) on schema/ontology matching 7 7 and review Semantic Web principles and technologies. 8 8 … … 17 17 \xne{FLaReNet}\furl{http://www.flarenet.eu/} -- Fostering Language Resources Network -- running 2007 to 2010 concentrated rather on ``community and consensus building'' developing a common vision and mapping the field of LRT via survey. 18 18 19 \xne{CLARIN}\furl{http://clarin.eu} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI) -- a comprehensive architecture for harmonized handling of metadata \cite{Broeder2011} --19 \xne{CLARIN}\furl{http://clarin.eu} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI) -- a comprehensive architecture for harmonized handling of metadata \cite{Broeder2011} -- 20 20 are the primary context of this work, therefore the description of this underlying infrastructure is detailed in separate chapter \ref{ch:infra}. 21 21 Both above-mentioned projects can be seen as predecessors to CLARIN, the IMDI metadata model being one starting point for the development of CMDI. … … 23 23 More of a sister-project is the initiative \xne{DARIAH} - Digital Research Infrastructure for the Arts and Humanities\furl{http://dariah.eu}. It has a broader scope, but has many personal ties as well as similar problems and similiar solutions as CLARIN. Therefore there are efforts to intensify the cooperation between these two research infrastructures for digital humanities. 24 24 25 \xne{META-SHARE} is another multinational project aiming to build an infrastructure for language resource \cite{Piperidis2012meta}, howeverfocusing more on Human Language Technologies domain.\furl{http://meta-share.eu}25 \xne{META-SHARE} is another multinational project aiming to build an infrastructure for language resource \cite{Piperidis2012meta}, however, focusing more on Human Language Technologies domain.\furl{http://meta-share.eu} 26 26 27 27 \begin{quotation} 28 \noindent 28 29 META-NET is designing and implementing META-SHARE, a sustainable network of repositories of language data, tools and related web services documented with high-quality metadata, aggregated in central inventories allowing for uniform search and access to resources. Data and tools can be both open and with restricted access rights, free and for-a-fee. 29 30 \end{quotation} 30 31 31 See \ref{def:META-SHARE} for more details about META-SHARE's catalog and metadata format.32 See \ref{def:META-SHARE} for more details about META-SHARE's catalogue and metadata format. 32 33 33 34 … … 36 37 37 38 In a broader view we should also regard the activities in the domain of libraries and information sciences (LIS). 38 Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalog s, libraries were the early adopters and driving force in the field of search federation even before the era of internet (e.g. \xne{Linked Systems Project} \cite{Fenly1988}), the LIS community certainly has a long tradition, wealth of experience and robust solutions with respect to metadata aggregation and harmonization and exploitation.39 Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogues, libraries were the early adopters and driving force in the field of search federation even before the era of internet (e.g. \xne{Linked Systems Project} \cite{Fenly1988}), the LIS community certainly has a long tradition, wealth of experience and robust solutions with respect to metadata aggregation and harmonization and exploitation. 39 40 %, starting collaborative efforts in mid 70s 40 41 … … 42 43 The biggest one is the \xne{Worldcat}\furl{http://www.worldcat.org/} (totalling 273.7 million records \cite{OCLCAnnualReport2012}) powered by OCLC, a cooperative of over 72.000 libraries worldwide. 43 44 44 In Europe, multiple recent initiatives have pursu itsimilar goals of pooling together the immense wealth of information sheltered in the many libraries:45 In Europe, multiple recent initiatives have pursued similar goals of pooling together the immense wealth of information sheltered in the many libraries: 45 46 \xne{The European Library}\furl{http://www.theeuropeanlibrary.org/tel4/} offers a search interface over more than 18 million digital items and almost 120 million bibliographic records from 48 National Libraries and leading European Research Libraries. 46 47 … … 50 51 Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) another initiative in the realm of \xne{Europeana} has been started, a Best Practice Network, coordinated by The European Library, designed to ``establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research''. 51 52 52 The related catalog s and formats are described in the section \ref{sec:lib-formats}.53 54 55 \section{Existing crosswalks (services)}53 The related catalogues and formats are described in the section \ref{sec:lib-formats}. 54 55 56 \section{Existing Crosswalks (Services)} 56 57 57 58 Crosswalks as list of equivalent fields from two schemas have been around already for a long time, in the world of enterprise systems, e.g. to bridge to legacy systems as well as in the LIS domain. \cite{Day2002crosswalks} lists a number of mappings between metadata formats, mostly betweeen Dublin Core and MARC families of formats.\footnote{\url{http://loc.gov/marc/marc2dc.html}, \url{http://www.loc.gov/marc/dccross.html}} 58 59 59 However, besides being restricted in terms of covered formats, these crosswalks are just static correspondence lists, often just available as documents and only limited coverage of formats. One effort , that comes nearer to our idea of a service delivering crosswalks dynamically is the \xne{Metadata Crosswalk Service}\footnote{\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}, \url{http://www.oclc.org/research/activities/xwalk.html}, (SOAP based)} offered by OCLC as part of \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118}60 However, besides being restricted in terms of covered formats, these crosswalks are just static correspondence lists, often just available as documents and only limited coverage of formats. One effort that comes nearer to our idea of a service delivering crosswalks dynamically is the \xne{Metadata Crosswalk Service}\footnote{\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}, \url{http://www.oclc.org/research/activities/xwalk.html}, (SOAP based)} offered by OCLC as part of \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118}, 60 61 61 62 \begin{quotation} … … 63 64 \end{quotation} 64 65 65 Although the website states ``Crosswalk Web Service is now a production system that has been incorporated into OCLC products and services'', the demo service\furl{http://errol.oclc.org/schemaTrans.oclc.org.search} is not accessible. Also, this service only offers crosswalks between formats relevant for the LIS community: \xne{Dublin Core, MARCXML, MARC-2709, MODS}. So, altogether the service does not seem suitable to be used as is for the purposes of this work. But it certainly can serve as inspiration asfor the specification of the planned service.66 Although the website states ``Crosswalk Web Service is now a production system that has been incorporated into OCLC products and services'', the demo service\furl{http://errol.oclc.org/schemaTrans.oclc.org.search} is not accessible. Also, this service only offers crosswalks between formats relevant for the LIS community: \xne{Dublin Core, MARCXML, MARC-2709, MODS}. So, altogether the service does not seem suitable to be used as is for the purposes of this work. But it certainly can serve as inspiration for the specification of the planned service. 66 67 67 68 \begin{comment} … … 79 80 \label{lit:schema-matching} 80 81 81 As Shvaiko \cite{shvaiko2012ontology} states ``\emph{Ontology matching} is a solution to the semantic heterogeneity problem. It finds correspondences between semantically related entities of ontologies.''82 As such, it provides a very suitable methodical foundation for the problem at hand -- the \emph{semantic mapping}. (In sections \ref{sec:schema-matching-app} and \ref{sec:values2entities} we elaborate on the possible ways to apply these methods to the described problem.)82 As Shvaiko \cite{shvaiko2012ontology} states ``\emph{Ontology matching} is a solution to the semantic heterogeneity problem. It finds correspondences between semantically related entities of ontologies.'' 83 As such, it provides a very suitable methodical foundation for the problem at hand -- the \emph{semantic mapping}. (In sections \ref{sec:schema-matching-app} and \ref{sec:values2entities}, we elaborate on the possible ways to apply these methods to the described problem.) 83 84 84 85 There is a plethora of work on methods and technology in the field of \emph{schema and ontology matching} as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work \cite{Kalfoglou2003, Shvaiko2008, Noy2005_ontologyalignment, Noy2004_semanticintegration, Shvaiko2005_classification} and most recently \cite{shvaiko2012ontology, amrouch2012survey}. 85 86 86 %Shvaiko and Euzenat provide a summary of the key challenges\cite{Shvaiko2008} as well as a comprehensive survey of approaches for schema and ontology matching based on a proposed new classification of schema-based matching techniques\cite{}.87 88 87 Shvaiko and Euzenat also run the web page \url{http://www.ontologymatching.org/} dedicated to this topic and the related OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}}, an ongoing effort to evaulate alignment tools based on various alignment tasks from different domains. 89 88 90 Interestingly, \cite{shvaiko2012ontology} somewhat self-critically asks if after years of research ``the field of ontology matching [is] still making progress?''.89 Interestingly, \cite{shvaiko2012ontology} somewhat self-critically asks if after years of research ``the field of ontology matching [is] still making progress?''. 91 90 92 91 \subsubsection{Method} … … 113 112 114 113 \cite{EhrigSure2004} and \cite{amrouch2012survey} instead introduce \var{ontology mapping} when applying the task on individual entities, in the meaning as a function that ``for each concept (node) in ontology A [tries to] find a corresponding concept 115 (node), which has the same or similar semantics, in ontology B and vice vers e''. In the meaning as result it is ``formal expression describing a semantic relationship between two (or more) concepts belonging to two (or more) different ontologies''.116 117 \cite{EhrigSure2004} further specify the mapping function as based on a similarity function ,that for a pair of entities from two (or more) ontologies computes a ratio indicating the semantic proximity of the two entities.114 (node), which has the same or similar semantics, in ontology B and vice versa''. In the meaning as result it is ``formal expression describing a semantic relationship between two (or more) concepts belonging to two (or more) different ontologies''. 115 116 \cite{EhrigSure2004} further specify the mapping function as based on a similarity function that for a pair of entities from two (or more) ontologies computes a ratio indicating the semantic proximity of the two entities. 118 117 119 118 \begin{defcap}[!ht] … … 135 134 \cite{Algergawy2010} classifies, reviews, and experimentally compares major methods of element similarity measures and their combinations. \cite{shvaiko2012ontology} comparing a number of recent systems finds that ``semantic and extensional methods are still rarely employed. In fact, most of the approaches are quite often based only on terminological and structural methods. 136 135 137 \cite{Ehrig2006} employs this \var{similarity} function over single entities to derive the notion of \var{ontology similarity} as ``based on similarity of pairs of single entities from the different ontologies''. This is operationalized as some kind of aggregating function \cite{ehrig2004qom}, that combines all similiarity measures (mostly modulated by custom weighting) computed for pairs of single entities again into one value (from the \var{[0,1]} range) expressing the similarity ratio of the two ontologies being compared. (The employment of weights allows to apply machine learning approaches for optimization of the results.)136 \cite{Ehrig2006} employs this \var{similarity} function over single entities to derive the notion of \var{ontology similarity} as ``based on similarity of pairs of single entities from the different ontologies''. This is operationalized as some kind of aggregating function \cite{ehrig2004qom} that combines all similarity measures (mostly modulated by custom weighting) computed for pairs of single entities again into one value (from the \var{[0,1]} range) expressing the similarity ratio of the two ontologies being compared. (The employment of weights allows to apply machine learning approaches for optimization of the results.) 138 137 139 138 Thus, \var{ontology similarity} is a much weaker assertion, than \var{ontology alignment}, in fact, the computed similarity is interpreted to assert ontology alignment: the aggregated similarity above a defined threshold indicates an alignment. … … 149 148 \end{enumerate} 150 149 151 In contrast, \cite{jimenez2012large} in their system \xne{LogMap2} reduce the process into just two steps: computation of mapping candidates (maximise recall) and assessment of the candidates (maximize precision) , that howevercorrespond to the steps 2 and 3 of the above procedure and in fact the other steps are implicitly present in the described system.150 In contrast, \cite{jimenez2012large} in their system \xne{LogMap2} reduce the process into just two steps: computation of mapping candidates (maximise recall) and assessment of the candidates (maximize precision) that however, correspond to the steps 2 and 3 of the above procedure and in fact the other steps are implicitly present in the described system. 152 151 153 152 … … 155 154 A number of existing systems for schema/ontology matching/alignment is collected in the above-mentioned overview publications: 156 155 157 \xne{IF-Map} \cite{kalfoglou2003if}, \xne{QOM} \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, \xne{Similarity Flooding (SF)} \cite{melnik }, \xne{S-Match} \cite{Giunchiglia2007_semanticmatching}, the \xne{Prompt} tools \cite{Noy2003_theprompt} integrating with \xne{Protégé} or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.158 159 All of the tools use multiple methods as described in the previous section, exploiting both element as well as structural features and applying some kind of composition or aggregation of the computed atomic measures, to arrive to a alignment assertion.156 \xne{IF-Map} \cite{kalfoglou2003if}, \xne{QOM} \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, \xne{Similarity Flooding (SF)} \cite{melnik2002similarity}, \xne{S-Match} \cite{Giunchiglia2007_semanticmatching}, the \xne{Prompt} tools \cite{Noy2003_theprompt} integrating with \xne{Protégé} or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}. 157 158 All of the tools use multiple methods as described in the previous section, exploiting both element features as well as structural features and applying some kind of composition or aggregation of the computed atomic measures, to arrive to a alignment assertion. 160 159 161 160 Next to OWL as input format supported by all the systems some also accept XML Schemas (\xne{COMA++, SF, Cupid, SMatch}), … … 169 168 \section{Semantic Web -- Linked Open Data} 170 169 171 Linked Data paradigm \cite{TimBL2006} for publishing data on the web is increasingly been taken up by data providers across many disciplines \cite{bizer2009linked}. \cite{HeathBizer2011} gives comprehensive overview of the principles of Linked Data with practical examples and current applications.170 Linked Data paradigm \cite{TimBL2006} for publishing data on the web is increasingly been taken up by data providers across many disciplines \cite{bizer2009linked}. \cite{HeathBizer2011} gives comprehensive overview of the principles of Linked Data with practical examples and current applications. 172 171 173 172 \subsubsection{Semantic Web - Technical solutions / Server applications} 174 173 \label{semweb-tech} 175 174 176 The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently via SPARQL \cite{SPARQL2008} and \textit{idealiter} expose them via a web interface to the users.175 The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently via SPARQL \cite{SPARQL2008} and \textit{idealiter} expose them via a web interface to the users. 177 176 178 177 Meanwhile a number of RDF triple store solutions relying both on native, DBMS-backed or hybrid persistence layer are available, open-source solutions like \xne{Jena, Sesame} or \xne{BigData} as well as commercial solutions \xne{AllegroGraph, OWLIM, Virtuoso}. 179 178 180 A qualitative and quantitative study \cite{Haslhofer2011europeana} in the context of Europeana evaluated a number of RDF stores (using the whole Europeana EDM data set = 382,629,063 triples as data load) and came to the conclusion,that ``certain RDF stores, notably OpenLink Virtuoso and 4Store'' can handle the large test dataset.181 182 \xne{OpenLink Virtuoso Universal Server}\furl{http://virtuoso.openlinksw.com} is hybrid storage solution for a range of data models, including relational data, RDF and XML, and free text documents. \cite{Erling2009Virtuoso, Haslhofer2011europeana}183 Virtuoso is used to host many important Linked Data sets, e.g. ,DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}.184 Virtuoso is offered both as commercial and open-source version license models exist.179 A qualitative and quantitative study \cite{Haslhofer2011europeana} in the context of Europeana evaluated a number of RDF stores (using the whole Europeana EDM data set = 382,629,063 triples as data load) and came to the conclusion that ``certain RDF stores, notably OpenLink Virtuoso and 4Store'' can handle the large test dataset. 180 181 \xne{OpenLink Virtuoso Universal Server}\furl{http://virtuoso.openlinksw.com} is hybrid storage solution for a range of data models, including relational data, RDF and XML, and free text documents. \cite{Erling2009Virtuoso, Haslhofer2011europeana} 182 Virtuoso is used to host many important Linked Data sets, e.g. DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}. 183 Virtuoso is offered both as commercial and open-source version license models. 185 184 186 185 Another solution worth examining is the \xne{Linked Media Framework}\furl{http://code.google.com/p/lmf/} -- ``easy-to-setup server application that bundles together three Apache open source projects to offer some advanced services for linked media management'': publishing legacy data as linked data, semantic search by enriching data with content from the Linked Data Cloud, using SKOS thesaurus for information extraction. … … 206 205 There exists also a sizable number of stand-alone solutions (\xne{Ontorama, FOAFnaut, IsaViz, GKB-Editor} and more) though often bound to a specific dataset or data type (\xne{Wordnet, FOAF, Cyc}). 207 206 208 There is also plenty of general graph visualization tools , that can be adopted for viewing the RDF data as graph, like the traditional graph layouting tool \xne{GraphViz dot}, or more recently \xne{Gephi} \cite{bastian2009gephi}, a stand-alone interactive tool for graph visualization with a number of layouting algorithms and display options. A rather recent generic visualization javascript library \xne{d3}\footnote{\url{http://d3js.org}} % \cite{bostock2011d3} seems especially appealing thanks to its data-driven paradigm, dedicated support for graphs with integrated customizable graph layouting algorithm and -- being pure javascript -- allowing web-based solutions.207 There is also plenty of general graph visualization tools that can be adopted for viewing the RDF data as graph, like the traditional graph layouting tool \xne{GraphViz dot}, or more recently \xne{Gephi} \cite{Bastian2009gephi}, a stand-alone interactive tool for graph visualization with a number of layouting algorithms and display options. A rather recent generic visualization javascript library \xne{d3}\footnote{\url{http://d3js.org}} % \cite{bostock2011d3} seems especially appealing thanks to its data-driven paradigm, dedicated support for graphs with integrated customizable graph layouting algorithm and -- being pure javascript -- allowing web-based solutions. 209 208 210 209 %Most recently a web-based version of this versatile tool has been released\furl{http://protegewiki.stanford.edu/wiki/WebProtege} that supports collaborative ontology development 211 210 212 The solutions are rather sparse when it comes to more advanced visualizations, beyond the simple one to one display of the data model graph as a visual graph, especially the visualization of ontology mapping and alignment. Besides \xne{OLA} \cite{euzenat2004ola}, \xne{PromptViz} \cite{Noy2003_theprompt} and \xne{CogZ} \cite{falconer2009cogz} we would like to point out one solution developed at the IFS of the Technical University in Vienna \cite{lanzenberger2006alviz}, \xne{AlViz} -- a tool whichvisually supports semi-automatic alignment of ontologies. It is implemented as a ``multiple-view plug-in for Protege using J-Trees and Graphs. Based on similarity measures of an ontology matching algorithm AlViz helps to assess and optimize the alignment results.'' It applies visual clues like colouring to indicate the computed similarity of concepts between two ontologies and clustering for reducing the complexity of the displayed datasets (cf. figure \ref{fig:alviz}). Unfortunately, the development of this very promising research prototype seems to have stalled, the last available version being from 2009.211 The solutions are rather sparse when it comes to more advanced visualizations, beyond the simple one to one display of the data model graph as a visual graph, especially the visualization of ontology mapping and alignment. Besides \xne{OLA} \cite{euzenat2004ola}, \xne{PromptViz} \cite{Noy2003_theprompt} and \xne{CogZ} \cite{falconer2009cogz} we would like to point out one solution developed at the IFS of the Technical University in Vienna \cite{lanzenberger2006alviz}, \xne{AlViz}, a tool that visually supports semi-automatic alignment of ontologies. It is implemented as a ``multiple-view plug-in for Protege using J-Trees and Graphs. Based on similarity measures of an ontology matching algorithm AlViz helps to assess and optimize the alignment results.'' It applies visual clues like colouring to indicate the computed similarity of concepts between two ontologies and clustering for reducing the complexity of the displayed datasets (cf. figure \ref{fig:alviz}). Unfortunately, the development of this very promising research prototype seems to have stalled, the last available version being from 2009. 213 212 214 213 \begin{figure*} … … 228 227 \subsubsection{Linguistic ontologies} 229 228 230 One prominent instance of a linguistic ontology is \xne{General Ontology for Linguistic Description} or GOLD \cite{Farrar2003}\furl{http://linguistics-ontology.org},231 that ``gives a formalized account of the most basic categories and relations (the "atoms") used in the scientific description of human language, attempting to codify the general knowledge of the field. The motivation is to`` facilite automated reasoning over linguistic data and help establish the basic conceptsthrough which intelligent search can be carried out''.229 One prominent instance of a linguistic ontology is \xne{General Ontology for Linguistic Description} or GOLD \cite{Farrar2003}\furl{http://linguistics-ontology.org}, 230 that ``gives a formalized account of the most basic categories and relations (the `atoms') used in the scientific description of human language, attempting to codify the general knowledge of the field''. The motivation is to ``facilitate automated reasoning over linguistic data and help establish the basic concepts, through which intelligent search can be carried out''. 232 231 233 232 In line with the aspiration ``to be compatible with the general goals of the Semantic Web'', the dataset is provided via a web application as well as a dump in OWL format\furl{http://linguistics-ontology.org/gold-2010.owl} \cite{GOLD2010}. 234 233 235 234 236 Founded in 1934, SIL International\furl{http://www.sil.org/about-sil} (originally known as the Summer Institute of Linguistics, Inc) is a leader in the identification and documentation of the world's languages. Results of this research are published in Ethnologue: Languages of the World\furl{http://www.ethnologue.com/} \cite{grimes2000ethnologue}, a comprehensive catalog of the world's nearly 7,000 living languages. SIL also maintains Language \& Culture Archives a large collection of all kindsresources in the ethnolinguistic domain \furl{http://www.sil.org/resources/language-culture-archives}.235 Founded in 1934, SIL International\furl{http://www.sil.org/about-sil} (originally known as the Summer Institute of Linguistics, Inc) is a leader in the identification and documentation of the world's languages. Results of this research are published in Ethnologue: Languages of the World\furl{http://www.ethnologue.com/} \cite{grimes2000ethnologue}, a comprehensive catalogue of the world's nearly 7,000 living languages. SIL also maintains Language \& Culture Archives, a large collection of all kinds of resources in the ethnolinguistic domain \furl{http://www.sil.org/resources/language-culture-archives}. 237 236 238 237 World Atlas of Language Structures (WALS) \furl{http://WALS.info} \cite{wals2011} 239 is ``a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) 240 241 Simons \cite{Simons2003developing} developed a Semantic Interpretation Language (SIL) that is used to define the meaning of the elements and attributes in an XML markup schema in terms of abstract concepts defined in a formal semantic schema 238 is ``a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars)''. First appeared 2005, current online version published in 2011 provides a compendium of detailed expert definitions of individual linguistic features, accompanied by a sophisticated web interface integrating the information on linguistic features with their occurrence in the world languages and their geographical distribution. 239 240 Simons \cite{Simons2003developing} developed a Semantic Interpretation Language (SIL) that is used to define the meaning of the elements and attributes in an XML markup schema in terms of abstract concepts defined in a formal semantic schema. 242 241 Extending on this work, Simons et al. \cite{Simons2004semantics} propose a method for mapping linguistic descriptions in plain XML into semantically rich RDF/OWL, employing the GOLD ontology as the target semantic schema. 243 242 244 These ontologies can be used by (``ontologized'') Lexiconsrefer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.245 246 247 Work on Semantic Interpretation Language as well as the GOLD ontology can be seen as conceptual predecessor of the Data Category Registry a ISO-standardized procedure for defining and standardizing ``widely accepted linguistic concepts'',that is at the core of the CLARIN's metadata infrastructure (cf. \ref{def:DCR}).248 Although not exactly an ontology in the common sense of249 Although (by design) this registry does not contain any relations between concepts, 250 the central entities are concepts and not lexical items, thus it can be seen as a proto-ontology.243 These ontologies can be used by (``ontologized'') lexicons to refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings. 244 245 246 Work on Semantic Interpretation Language as well as the GOLD ontology can be seen as conceptual predecessor of the Data Category Registry, an ISO-standardized procedure for defining and standardizing ``widely accepted linguistic concepts'' that is at the core of the CLARIN's metadata infrastructure (cf. \ref{def:DCR}). 247 Although not exactly an ontology in the common sense -- 248 given that this registry (by design) does not contain any relations between concepts -- 249 the central entities are concepts and not lexical items, thus it can be seen as a semantic resource. 251 250 Another indication of the heritage is the fact that concepts of the GOLD ontology were migrated into ISOcat (495 items) in 2010. 252 251 253 252 Notice that although this work is concerned with language resources, it is primarily on the metadata level, thus the overlap with linguistic ontologies codifying the discipline specific linguistic terminology is rather marginal (perhaps on level of description of specific linguistic aspects of given resources). 254 253 255 \subsubsection{Lexicalised ontologies, ``ontologized'' lexicons}254 \subsubsection{Lexicalised ontologies, ``ontologized'' lexicons} 256 255 257 256 The other type of relation between ontologies and linguistics or language are lexicalised ontologies. Hirst \cite{Hirst2009} elaborates on the differences between ontology and lexicon and the possibility to reuse lexicons for development of ontologies. … … 259 258 In a number of works Buitelaar, McCrae et. al \cite{Buitelaar2009, buitelaar2010ontology, McCrae2010c, buitelaar2011ontology, Mccrae2012interchanging} argues for ``associating linguistic information with ontologies'' or ``ontology lexicalisation'' and draws attention to lexical and linguistic issues in knowledge representation in general. This basic idea lies behind the series of proposed models \xne{LingInfo}, \xne{LexOnto}, \xne{LexInfo} and, most recently, \xne{lemon} aimed at allowing complex lexical information for such ontologies and for describing the relationship between the lexicon and the ontology. 260 259 The most recent in this line, \xne{lemon} or \xne{lexicon model for ontologies} defines ``a formal model for the proper representation of the continuum between: i) ontology semantics; ii) terminology that is used to convey this in natural 261 language; and iii) linguistic information on these terms and their constituent lexical units'', in essence enabling the creation of a lexicon for a given ontology, adopting the principle of ``semantics by reference", no complex semantic in- 262 formation needs to be stated in the lexicon. 263 a clear separation of the lexical layer and the ontological layer. 264 265 Lemon builds on existing work, next to the LexInfo and LIR ontology-lexicon models. 266 and in particular on global standards: W3C standard: SKOS (Simple Knowledge Organization System) \cite{SKOS2009} and ISO standards the Lexical Markup Framework (ISO 24613:2008 \cite{ISO24613:2008}) and 267 and Specification of Data Categories, Data Category Registry (ISO 12620:2009 \cite{ISO12620:2009}) 268 269 Lexical Markup Framework LMF \cite{Francopoulo2006LMF, ISO24613:2008} defines a metamodel for representing data in lexical databases used with monolingual and multilingual computer applications, provides a RDF serialization (?!?!). 260 language; and iii) linguistic information on these terms and their constituent lexical units''. 261 In essence, \xne{lemon} enables the creation of a lexicon for a given ontology, adopting the principle of ``semantics by reference". No complex semantic information needs to be stated in the lexicon, ensuring (or at least fostering) a clear separation of the lexical layer and the ontological layer. 262 263 Lemon builds on existing work, next to the LexInfo and LIR ontology-lexicon models, and in particular on global standards: W3C standard, SKOS (Simple Knowledge Organization System) \cite{SKOS2009} and ISO standards the Lexical Markup Framework (ISO 24613:2008 \cite{ISO24613:2008}) and Specification of Data Categories, Data Category Registry (ISO 12620:2009 \cite{ISO12620:2009}). 264 265 Lexical Markup Framework LMF \cite{Francopoulo2006LMF, ISO24613:2008} defines a metamodel for representing data in lexical databases used with monolingual and multilingual computer applications. LMF specifies also a RDF serialization. 270 266 271 267 An overview of current developments in application of the linked data paradigm for linguistic data collections was given at the workshop Linked Data in Linguistics\furl{http://ldl2012.lod2.eu/} 2012 \cite{ldl2012}. 272 268 273 269 274 The primary motivation for linguistic ontologies like \xne{lemon} are the tasks ontology-based information extraction, ontology learning and population from text, where the entities are often referred to by non-nominal word forms and with ambiguous semantics. Given ,that the discussed collection contains mainly highly structured data referencing entities in their nominal form, linguistic ontologies are not directly relevant for this work.270 The primary motivation for linguistic ontologies like \xne{lemon} are the tasks ontology-based information extraction, ontology learning and population from text, where the entities are often referred to by non-nominal word forms and with ambiguous semantics. Given that the discussed collection contains mainly highly structured data referencing entities in their nominal form, linguistic ontologies are not directly relevant for this work. 275 271 276 272 277 273 \section{Summary} 278 This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and on the other handgave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization.274 This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and, on the other hand, it gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization. -
SMC4LRT/chapters/Results.tex
r3776 r4117 6 6 In the subsequent two sections, we explore a few specific aspects of the CMD data domain -- regarding the usage of the data categories (\ref{sec:explore-datcats}) and the integration of existing formats (\ref{sec:explore-formats}). While these topics are not directly results of this work, the presented analyses are. They were made possible by the technical solution of this work, yield a valuable test case for the usefulness of the work and are an indispensable prerequisite for the necessary coordination and curation work being carried out by the CMDI community. 7 7 8 \section{Current status of the infrastructure}8 \section{Current Status of the Infrastructure} 9 9 Before we get to the results of this work, we briefly summarize the current state of affairs within the CLARIN infrastructure at large to help contextualize the actual results. 10 10 11 \subsection{CMDI - services}11 \subsection{CMDI -- Services} 12 12 The main services of the infrastructure have been in stable production for the last two years. 13 13 Relation Registry is operational as early prototype. 14 14 Three instances of \xne{OpenSKOS} are running, one of them being hosted by \xne{ACDH}. 15 15 16 \subsection{CMDI - data}16 \subsection{CMDI -- Data} 17 17 More than 130 profiles are defined. (See table \ref{table:dev_profiles} for more details about profiles.) 18 18 The official CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/} collects data from 69 providers on daily basis. 19 19 The collection amounts to over 550.000 records in more than 60 distinct profiles. 20 20 21 \subsection{ACDH - the home of SMC}22 21 \subsection{ACDH -- The Home of SMC} 22 \label{acdh} 23 23 Within CLARIN-AT a new centre has been brought to life, the Austrian Centre for Digital Humanities with the mission to foster digital research paradigm in humanities. It is designed to provide depositing and publishing services to the DH community, as well as infrastructural services that are part of the CLARIN Metadata Infrastructure. SMC is one of these services provided by this centre. 24 24 Figure \ref{fig:acdh_context} sketches the broader context of \xne{ACDH} and its different roles. 25 25 26 26 %%%%%%%%%%%%%%%% 27 \section {Technical solution}27 \section {Technical Solution} 28 28 With this work we delivered a module embedded in a larger metadata infrastructure, aimed at supporting the semantic interoperability across the heterogeneous data in this infrastructure. The module consists of multiple interrelated components. The technical specification of the module can be found in chapter \ref{ch:design}. A prototypical implementation has been developed for the three main parts of the system. The code of this implementation is maintained in the central CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}. 29 29 … … 31 31 \\ 32 32 33 \url{http://clarin.arz.oeaw.ac.at/smc} (soon: \url{http://acdh.ac.at/smc}) 34 35 36 \subsection{SMC - crosswalks service} 37 the crosswalk service as a REST web service 38 39 exposes an interface that provides mappings between search indexes as defined in \ref{sec:cx} 40 41 This interface is available as part of the smc application: 42 43 \url{http://clarin.arz.oeaw.ac.at/smc/cx} 44 45 \subsection{SMC - as a module within Metadata Repository} 46 The SMC is also integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain. 47 48 \url{http://clarin.arz.oeaw.ac.at/mdrepo/} (module not integrated yet ) 49 50 \subsection{SMC Browser -- advanced interactive user interface} 33 \url{http://clarin.oeaw.ac.at/smc/} 34 35 36 \subsection{SMC -- Crosswalks Service} 37 The crosswalk service as a REST web service exposes an interface that provides mappings between search indexes as defined in \ref{sec:cx}. This interface is available via the wrapping smc application: 38 39 \url{http://clarin.oeaw.ac.at/smc/cx} 40 41 \subsection{SMC -- as a Module within Metadata Repository} 42 The SMC will also be integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain. 43 44 \url{http://clarin.oeaw.ac.at/mdrepo/} 45 46 \subsection{SMC Browser -- Advanced Interactive User Interface} 51 47 52 48 SMC Browser is an advanced web-based visualization application to explore the complex dataset of the \xne{Component Metadata Infrastructure}, by visualizing its structure as an interactive graph. In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation. Details about design and implementation can be found in \ref{smc-browser}. The publicly available instance is maintained under: 53 49 54 \url{http://clarin. arz.oeaw.ac.at/smc-browser}50 \url{http://clarin.oeaw.ac.at/smc-browser} 55 51 56 52 \begin{figure*} … … 63 59 64 60 65 %%%%%%%%%%%%%%% 55566 \section{Exploring the CMD data -- SMC reports}61 %%%%%%%%%%%%%%% 62 \section{Exploring the CMD Data -- SMC Reports} 67 63 SMC reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain that were created making extensive use of the visual and numerical output from the \xne{SMC Browser}. In this section, we deliver a few examples of these analyses. A complete up to date listing is maintained on the SMC website: 68 64 69 \url{http://clarin. aac.ac.at/smc/reports}70 71 \subsection{Usage of data categories}65 \url{http://clarin.oeaw.ac.at/smc-browser/docs/reports.html} 66 67 \subsection{Usage of Data Categories} 72 68 \label{sec:explore-datcats} 73 69 At the core of the whole SMC (and CMDI) are the data categories as basic semantic building blocks or anchors. … … 90 86 \includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf} 91 87 \end{center} 92 \caption{The four main \concept{Language} data categories and in which profiles they are being used}88 \caption{The four main \concept{Language} data categories and profiles they are being used in} 93 89 \label{fig:language_datcats} 94 90 \end{figure*} … … 103 99 Again the main DC \concept{resourceName\#DC-2544}) being used in 74 profiles together with the semantically near \concept{resourceTitle\#DC-2545}) used in 69 profiles offer a good coverage over available data. 104 100 105 Some of the DCs referenced by \code{Name} elements are \concept{author\#DC-4115}), \concept{contact full name\#DC-2454}), \concept{dcterms:Contributor}, \concept{project name\#DC-2536}), \concept{web service name\#DC-4160}) and \concept{language name\#DC-2484}). This implies ,that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.101 Some of the DCs referenced by \code{Name} elements are \concept{author\#DC-4115}), \concept{contact full name\#DC-2454}), \concept{dcterms:Contributor}, \concept{project name\#DC-2536}), \concept{web service name\#DC-4160}) and \concept{language name\#DC-2484}). This implies that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values. 106 102 107 103 %\subsection{Resource type} … … 109 105 % \subsection{Subject, Genre, Topic} 110 106 111 \subsection{Integration of existing formats}107 \subsection{Integration of Existing Formats} 112 108 \label{sec:explore-formats} 113 109 … … 118 114 \subsubsection{dublincore / OLAC} 119 115 \label{reports:OLAC} 120 Very widely used (because) simple format116 Very widely used (because) simple metadata format 121 117 \ref{def:OLAC} 122 118 %\ref{info:olac-records} 123 119 124 120 Here the problem of proliferation seems especially virulent. Table \ref{table:dcterms-profiles} lists all the profiles modelling dcterms. 125 As all these profiles are link to the corresponding dublincore data categories, this does not pose a major problem on the exploitation side, however the cluttering of the component registry with structurally identical or almost identical profiles needs to be questioned within the community.121 As all these profiles are link to the corresponding dublincore data categories, this does not pose a major problem on the exploitation side, however, the cluttering of the component registry with structurally identical or almost identical profiles needs to be questioned within the community. 126 122 127 123 \begin{figure*}[!ht] … … 135 131 136 132 \begin{table} 137 \caption{Profiles modelling dublincore terms}133 \caption{Profiles Modelling Dublincore Terms} 138 134 \label{table:dcterms-profiles} 139 135 % \begin{tabular}{ |l | l | l | r | r | } … … 154 150 \end{table} 155 151 156 Additionally, there is a number of profiles with concept links to dublincore terms ,152 Additionally, there is a number of profiles with concept links to dublincore terms. 157 153 Some use all of the dublincore elements or terms as one component within a larger profile, 158 154 one example being the \xne{data} profile created by the Czech initiative LINDAT models the minimal obligatory set of META-SHARE \xne{resourceInfo} schema, cf. subsection about META-SHARE below) combined with a simple dublincore record. … … 180 176 \label{results:tei} 181 177 TEI is a de-facto standard for encoding any kind of textual resources. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description / metadata the complex element \code{teiHeader} is foreseen. 182 TEI does not provide just one fixed schema, but allows for a certain flexibility wrt toelements used and inner structure, allowing to generate custom schemas adopted to projects' needs. \ref{def:tei}.178 TEI does not provide just one fixed schema, but allows for a certain flexibility regarding the elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. \ref{def:tei}. 183 179 Thus there is also not just one fixed \xne{teiHeader}. 184 180 185 181 The widespread use of TEI for encoding textual resources brings about a strong interest of multiple research teams of the CLARIN community to integrate TEI with CMDI. There was a first attempt already in 2010, modelling the recommended \xne{teiHeader}\furl{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html\#HD7}, encoding \xne{fileDesc} and \xne{profileDesc} components, leaving out \xne{encodingDesc} and \xne{revisionDesc}. The leaf elements were bound to the most prominent data categories, making it a mixture of both dublincore and isocat. 186 182 187 The large research project \xne{Deutsches Textarchiv}\furl{http://deutschestextarchiv.de/} \cite{Geyken2011deutsches}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information. Part of the project is also to integrate the data and metadata with the CLARIN infrastructure, meaning CMD records need to be generated for the resources. For this the team generated a completely new profile (as yet private) closely modelling the version of the teiHeader\furl{http://www.deutschestextarchiv.de/doku/basisformat_header} used in the project.188 Regarding the question, why another teiHeader-based profile was generated not reusing the existing one, according to a personal note by a member of the project team and author of the profile, Axel Herold \cite{Herold2013} the profile was custom made for this particular project and it seemed undesirable to create a generalised TEI header profile.189 190 \xne{Nederlab} is another large-scale project aiming processing historic Dutch newspaper articles into a platform for search and analysis, starting 2013 in Netherlands\furl{http://www.nederlab.nl}. Within this project, the metadata is also encoded in a \concept{teiHeader} and the data shall be integrated within CLARIN. Here, another set of CMD profiles was created, howeverreusing existing components.183 The large research project \xne{Deutsches Textarchiv}\furl{http://deutschestextarchiv.de/} \cite{Geyken2011deutsches}, digitizing a hoist of historical German texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information. Part of the project is also to integrate the data and metadata with the CLARIN infrastructure, meaning CMD records need to be generated for the resources. For this the team generated a completely new profile (as yet private) closely modelling the version of the teiHeader\furl{http://www.deutschestextarchiv.de/doku/basisformat_header} used in the project. 184 Regarding the question, why another teiHeader-based profile was generated not reusing the existing one, according to a personal note by a member of the project team and author of the profile, Axel Herold \cite{Herold2013} the profile was custom made for this particular project and it seemed undesirable to create a generalised TEI header profile. 185 186 \xne{Nederlab} is another large-scale project aiming processing historic Dutch newspaper articles into a platform for search and analysis, starting 2013 in The Netherlands\furl{http://www.nederlab.nl}. Within this project, the metadata is also encoded in a \concept{teiHeader} and the data shall be integrated within CLARIN. Here, another set of CMD profiles was created, however, reusing existing components. 191 187 As seen in figure \ref{fig:teiHeader_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added. 192 188 193 Another approach was applied within the context of other CLARIN-NL projects \cite{Menzo2013-05tei}. Based on an ODD-file, a data category for every element of the teiHeader (135 datcats) was generated. In a subsequent step, an enriched schema was generated,that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:infra}). The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.189 Another approach was applied within the context of other CLARIN-NL projects \cite{Menzo2013-05tei}. Based on an ODD-file, a data category for every element of the teiHeader (135 datcats) was generated. In a subsequent step, an enriched schema was generated that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:infra}). The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs. 194 190 This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question. 195 191 … … 230 226 %In cooperation between metadata teams from CLARIN and META-SHARE 231 227 232 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type howeverall four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.233 234 In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), howevercombined with a simple dublincore record.235 This way, the information gets partly duplicated, but with the advantage ,that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.236 237 The expression of the META-SHARE schema in CMD allows a direct comparison of the two different approaches taken in the two projects: a metamodel allowing to generate custom profiles with shared semantics vs. the more traditional way of trying to generate one schema to fit in all the information. It shows nicely the trade-off: many custom schemas with the risk of proliferation and problems with semantic interoperability or one very large with the risk of overwhelming the user and still not being able to capture all specific information s.228 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type, however, all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. 229 230 In a parallel effort, LINDAT, the Czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however, combined with a simple dublincore record. 231 This way, the information gets partly duplicated, but with the advantage that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema. 232 233 The expression of the META-SHARE schema in CMD allows a direct comparison of the two different approaches taken in the two projects: a metamodel allowing to generate custom profiles with shared semantics vs. the more traditional way of trying to generate one schema to fit in all the information. It shows nicely the trade-off: many custom schemas with the risk of proliferation and problems with semantic interoperability or one very large with the risk of overwhelming the user and still not being able to capture all specific information. 238 234 239 235 \begin{figure*} … … 249 245 \includegraphics[width=0.75\textwidth]{images/LINDAT-profile-data.png} 250 246 \end{center} 251 \caption{ profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }247 \caption{Profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements } 252 248 \label{fig:META-SHARE-LINDAT} 253 249 \end{figure*} … … 274 270 \includegraphics[height=0.95\textheight]{images/resourceInfoBIG.png} 275 271 \end{center} 276 \caption{ the META-SHARE based profile for describing corpora}272 \caption{The META-SHARE based profile for describing corpora} 277 273 \label{fig:META-SHARE-BIG} 278 274 \end{figure*} … … 280 276 281 277 %%%%%%%%%%%%%%%%%%%%%%% 282 \subsection{SMC cloud}278 \subsection{SMC Cloud} 283 279 \label{sec:smc-cloud} 284 As a latest, still experimental, addition, SMC browser provides a special type of graph,that displays only profiles. The links between them reflect the reuse of components and data categories (i.e. how many components or data categories do the linked pairs of profiles share), indicating the degree of similarity or semantic proximity. Figure \ref{fig:SMC_cloud} depicts one possible output of the graph280 As the latest, still experimental, addition, SMC browser provides a special type of graph that displays only profiles. The links between them reflect the reuse of components and data categories (i.e. how many components or data categories do the linked pairs of profiles share), indicating the degree of similarity or semantic proximity. Figure \ref{fig:SMC_cloud} depicts one possible output of the graph 285 281 covering a large part of the defined profiles. It shows nicely the clusters of strongly related profiles in contrast to the greater distances between more loosely connected profiles. 286 282 … … 293 289 \end{figure*} 294 290 295 \begin{comment}296 \section{Evaluation}297 \label{evaluation}298 299 Sample Queries:300 301 candidate Categories:302 ResourceType, Format303 Genre, Topic304 Project, Institution, Person, Publisher305 306 307 308 \subsection{Use Cases}309 310 \begin{itemize}311 312 \item MD Search employing Semantic Mapping313 \item MD Search employing Fuzzy Search314 \item Visualize impact of given mapping in terms of covered dataset (number of matched records).315 \end{itemize}316 317 318 \section{Discussion}319 320 \subsection{Semantic Mapping in Metadata vs. Content/Annotation}321 AF + DCR + RR322 323 \end{comment}324 325 291 \section{Summary} 326 292 In this final chapter, we presented the results, on the one hand the technical solution of the module \xne{Semantic Mapping Component}, on the other hand we spent a good part of the chapter on commented analyses of the processed dataset, that were made possible by \xne{SMC Browser}, a interactive visualization tool developed as part of this work for exploration of the schema level data of the discussed collection. As such, the analyses can be seen as an evaluation, a proof of concept and usefulness of the presented work. -
SMC4LRT/chapters/abstract_de.tex
r3665 r4117 1 1 \chapter*{Kurzfassung} 2 2 3 Diese Arbeit ist eingebettet in eine groÃe internationale Forschungsinfrastruktur-Iinitiave, die zur Aufgabe hat, 4 einfachen, stabilen, harmonisierten Zugang zu Sprachressourcen und Technologien in Europa zu ermöglichen, der \emph{Common Language Resource and Technology Infrastructure} oder CLARIN. Das technische HerzstÃŒck dieser Unternehmung is die \emph{Component Metadata Infrastructure}, ein verteiltes System, das harmonisiertes koherentes Erstellen und Verbreiten von Metadaten fÃŒr Sprachressourcen ermöglicht. Das Ergebnis dieser Arbeit, das Modul \emph{Semantic Mapping Component}, wurde als Bestandteil des Systems erdacht, um unter Ausnutzung der in die Infrastruktur eingebauten Mechanismen das Problem der semantischen InteroperabilitÀt zu ÃŒberwinden, das sich aus der HeterogenitÀt der Metatadaten-Formate ergibt. 5 6 Das eigentliche Ziel, der Nutzen dieser Arbeit -- im Einklang mit der generellen Idee des ganzen Unterfangens -- war die \emph{Verbesserung der Suchmöglichkeiten} in der groÃen heterogenen Sammlung von Metadaten. Diese Aufgabe wurde in zwei separaten sich ergÀnzenden Herangehensweisen angegangen: a) Entwurf und Entwicklung eines Dienstes (service) zur Bereitstellung von \emph{crosswalks} (Entsprechungen zwischen Feldern in unterschiedlichen Metadaten-Formaten) auf der Basis von wohldefinierten Konzepten und die Anwendung dieser \emph{crosswalks} bei Suchszenarien um die Trefferquote zu erhöhen. b) die integrative Kraft des \emph{Linked Open Data} Paradigma anerkennend, Modellierung der DomÀndaten als eine \emph{Semantic Web} Ressource, um die Nutzung von semantischen Technologien auf dem Datensatz zu ermöglichen. 3 Das eigentliche Ziel, der Nutzen dieser Arbeit war die \emph{Verbesserung der Suchmöglichkeiten} in einer groÃen heterogenen Sammlung von Metadaten. Diese Aufgabe wurde in zwei separaten sich ergÀnzenden Herangehensweisen angegangen: a) Entwurf und Entwicklung eines Dienstes (service) zur Bereitstellung von \emph{crosswalks} (Entsprechungen zwischen Feldern in unterschiedlichen Metadaten-Formaten) auf der Basis von wohldefinierten Konzepten und die Anwendung dieser \emph{crosswalks} bei Suchszenarien um die Trefferquote zu erhöhen. b) die integrative Kraft des \emph{Linked Open Data} Paradigma anerkennend Modellierung der DomÀndaten als eine \emph{Semantic Web} Ressource, um die Nutzung von semantischen Technologien auf dem Datensatz zu ermöglichen. 7 4 8 5 Entsprechend den zwei Herangehensweisen lieferte die Arbeit auch zwei Hauptergebnisse: a) die Spezifikation eines Moduls fÃŒr \emph{konzept-basierte Suche} zusammen mit dem zugrundeliegenden Dienst \emph{crosswalk service}, begleitet von einer Testimplementierung; b) Spezifikation der Modellierung der Ausgangsdaten im RDF Format, womit die Grundlage geschaffen ist, die Daten als \emph{Linked Open Data} bereitzustellen. 9 6 10 7 Teilweise als Nebenprodukt wurde auch die Anwendung \emph{SMC Browser} entwickelt -- ein interaktives Visualisierungswerkzeug zur ErschlieÃung der Schema-Ebene der Datensammlung. Mit Hilfe dieses Werkzeugs konnte eine Reihe von tiefergehenden Analysen der Daten erstellt werden, die direkt von der Forschergemeinschaft zur ErschlieÃung und Redaktion der komplexen Daten genutzt werden. Somit können die Anwendung und die Analyseberichte als ein wertvoller Beitrag fÃŒr die Forschergemeinschaft angesehen werden. 8 9 Diese Arbeit ist eingebettet in eine groÃe internationale Forschungsinfrastrukturinitiave, die zur Aufgabe hat, 10 einfachen, stabilen, harmonisierten Zugang zu Sprachressourcen und Technologien in Europa zu ermöglichen, der \emph{Common Language Resource and Technology Infrastructure} oder CLARIN. Das technische HerzstÃŒck dieser Unternehmung is die \emph{Component Metadata Infrastructure}, ein verteiltes System, das harmonisiertes kohÀrentes Erstellen und Verbreiten von Metadaten fÃŒr Sprachressourcen ermöglicht. Das Ergebnis dieser Arbeit, das Modul \emph{Semantic Mapping Component}, wurde als Bestandteil des Systems erdacht, um unter Ausnutzung der in die Infrastruktur eingebauten Mechanismen das Problem der semantischen InteroperabilitÀt zu ÃŒberwinden, das sich aus der HeterogenitÀt der Metatadaten-Formate ergibt. -
SMC4LRT/chapters/abstract_en.tex
r3665 r4117 1 1 \chapter*{Abstract} 2 2 3 4 This work is embedded in the context of a large research infrastructure initiative aimed at providing easy, stable, harmonized access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in into the core of the infrastructure. 5 6 The ultimate objective of this work -- in line with the overall mission of the whole initiative -- was to \emph{enhance search functionality} over the large heterogeneous collection of resource descriptions. This objective was pursued in two separate, complementary approaches: a) design a service delivering crosswalks (i.e. equivalences between fields in disparate metadata formats) based on well-defined concepts and apply this concept-based crosswalks in search scenarios to enhance recall. b) acknowledging the integrative power of the \emph{Linked Open Data} paradigm, express the domain data as a \emph{Semantic Web} resource, to enable the application of semantic technologies on the dataset. 3 The ultimate objective of this work was to \emph{enhance search functionality} over a large heterogeneous collection of resource descriptions. This objective was pursued in two separate, complementary approaches: a) design a service delivering crosswalks (i.e. equivalences between fields in disparate metadata formats) based on well-defined concepts and apply this concept-based crosswalks in search scenarios to enhance recall. b) acknowledging the integrative power of the \emph{Linked Open Data} paradigm, express the domain data as a \emph{Semantic Web} resource, to enable the application of semantic technologies on the dataset. 7 4 8 5 In parallel with the two approaches, the work delivered two main results: a) the \emph{specification} of the module for \emph{concept-based search} together with the underlying \emph{crosswalks service} accompanied by a proof-of-concept implementation. And b) the blueprint for expressing the original dataset in RDF format, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}. … … 10 7 Partly as by-product, the application \emph{SMC browser} was developed -- an interactive visualization tool to explore the dataset on the schema level. This tool provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset. As such, the tool and the reports can be considered a valuable contribution to the community. 11 8 9 This work is embedded in the context of a large research infrastructure initiative aimed at providing easy, stable, harmonized access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in into the core of the infrastructure. 10 -
SMC4LRT/chapters/acknowledgements.tex
r3776 r4117 1 1 \chapter*{Acknowledgements} 2 2 3 I would like to thank all the colleagues from my institute and from the CLARIN community,for the support, the fruitful discussions and helpful feedback, especially Menzo Windhouwer, Daan Broeder, Dieter Van Uytvanck, Marc Kemps-Snijders and Hennie Brugman. \\4 And to all my dear one, for the extra portion of patience I demanded from them3 I would like to thank all the colleagues from the institute and from the CLARIN community for the support, the fruitful discussions and helpful feedback, especially Menzo Windhouwer, Daan Broeder, Dieter Van Uytvanck, Marc Kemps-Snijders and Hennie Brugman. \\ 4 And all my dear ones, for the extra portion of patience I demanded from them. 5 5 \\ 6 \\ 7 With love to em. 6 7 \hfill with love to em -
SMC4LRT/chapters/appendix.tex
r3776 r4117 6 6 \chapter{Data model reference} 7 7 \label{ch:data-model-ref} 8 In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model}, \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture ,that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC.8 In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model}, \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC. 9 9 10 \begin{figure*}[!ht] 10 \input{images/Terms.xsd} 11 12 \input{images/general-component-schema.xsd} 13 14 \begin{figure*} 15 \begin{center} 16 \includegraphics[width=1\textwidth]{images/EDC_components_v4.png} 17 \end{center} 18 \caption{Reference Architecture} 19 \label{fig:ref_arch} 20 \end{figure*} 21 22 \begin{figure*}[p] 11 23 \begin{center} 12 24 \includegraphics[width=1\textwidth]{images/DCR_data_model.jpg} … … 16 28 \end{figure*} 17 29 18 \input{images/Terms.xsd}19 20 \input{images/general-component-schema.xsd}21 22 23 30 \begin{figure*}[!ht] 24 31 \begin{center} 25 \includegraphics[width=1\textwidth]{images/EDC_components_v4.png} 26 \end{center} 27 \caption{Reference Architecture} 28 \label{fig:ref_arch} 29 \end{figure*} 30 31 \begin{figure*}[!ht] 32 \begin{center} 33 \includegraphics[width=1\textheight, angle=90]{images/acdh-diagram_300dpi.png} 32 \includegraphics[width=0.95\textheight, angle=90]{images/acdh-diagram_300dpi.png} 34 33 \end{center} 35 34 \caption{Austrian Centre for Digital Humanities - the home of SMC - in context} … … 37 36 \end{figure*} 38 37 39 \chapter{CMD -- sample data}38 \chapter{CMD -- Sample Data} 40 39 \label{ch:cmd-sample} 41 40 … … 45 44 \input{chapters/collection_spec.xml.tex} 46 45 47 \section{CMD record}46 \section{CMD Record} 48 47 Following listing represents a sample CMD record - an instance of the \concept{collection} profile listed above. 49 48 … … 51 50 52 51 53 \chapter{SMC -- documentation}52 \chapter{SMC -- Documentation} 54 53 \label{ch:smc-docs} 55 54 56 55 \begin{figure*} 57 56 \begin{center} 58 \includegraphics[ height=1\textwidth, angle=90]{images/build_init.png}57 \includegraphics[width=1.1\textheight, angle=90]{images/build_init.png} 59 58 \end{center} 60 59 \caption{A graphical representation of the dependencies and calls in the main \xne{ant} build file.} … … 62 61 \end{figure*} 63 62 64 \section{D ocumentation of smc-xsl}63 \section{Developer Documentation} 65 64 \label{sec:smc-xsl-docs} 66 \todoin{generate and reference XSLT-documentation}67 65 68 \section{SMC Browser user documentation} 66 A developer documentation of the code and the system is included in the source repository 67 68 \noindent 69 \url{https://svn.clarin.eu/SMC/trunk/SMC/docs} 70 71 \noindent 72 A short introduction can be found online as part of the application: 73 74 \noindent 75 \url{http://clarin.oeaw.ac.at/smc/docs/devdocs.html} 76 77 \section{SMC Browser User Documentation} 69 78 \label{sec:smc-browser-userdocs} 70 79 71 80 \input{chapters/userdocs_cleaned} 72 81 73 \section {Sample SMC graphs} 82 \clearpage 83 \section {Sample SMC Graphs} 74 84 \label{sec:smc-graphs} 75 85 … … 81 91 \label{fig:cmd-dep-dotgraph} 82 92 \end{figure*} 93 94 \begin{figure*}[h] 95 \begin{center} 96 \includegraphics[width=1\textwidth]{images/SMC-export_sample.png} 97 \end{center} 98 \caption{A sample output from SMC browser showing a number of frequently used data categories and the clusters of profiles using them.} 99 \label{fig:smc-sample} 100 \end{figure*} 101 83 102 84 103 -
SMC4LRT/chapters/userdocs_cleaned.tex
r3776 r4117 2 2 Explore the \DUroletitlereference{Component Metadata Framework} 3 3 4 In \emph{CMD}, metadata schemas are defined by profiles ,that are constructed out of reusable components - collections4 In \emph{CMD}, metadata schemas are defined by profiles that are constructed out of reusable components - collections 5 5 of metadata fields. The components can contain other components, and they can be reused in multiple profiles. 6 6 Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should … … 12 12 SMC Browser visualizes this graph structure in an interactive fashion. You can have a look at the \href{examples.html}{examples} for inspiration. 13 13 14 It is implemented on top of wonderful js-library \href{https://github.com/mbostock/d3}{d3}, the code checked in \href{https://svn.clarin.eu/SMC/trunk/SMC}{clarin-svn} (and needs refactoring). More technical documentation follows soon.14 It is implemented on top of wonderful js-library \href{https://github.com/mbostock/d3}{d3}, the code checked in \href{https://svn.clarin.eu/SMC/trunk/SMC}{clarin-svn} (and needs refactoring). There is also some preliminary \href{devdocs.html}{technical documentation} 15 15 16 16 … … 53 53 } 54 54 55 The User interface is divided into 4 main parts:55 The user interface is divided into 4 main parts: 56 56 % 57 57 \begin{description} 58 58 \item[{Index}] \leavevmode 59 59 Lists all available Profiles, Components, Elements and used Data Categories 60 The lists can be filtered (enter search pattern in the input box at the top of the index-pane) 61 By clicking on individual items, they are added to the \DUroletitlereference{selected nodes} and get rendered in the graph pane 60 The lists can be filtered (enter search pattern in the input box at the top of the index-pane). 61 By clicking on individual items, they are added to the \DUroletitlereference{selected nodes} and get rendered in the graph pane. 62 62 63 63 \item[{Main (Graph)}] \leavevmode … … 78 78 } 79 79 80 Following data sets are distinguished w rtuser interaction:80 Following data sets are distinguished with respect to the user interaction: 81 81 % 82 82 \begin{description} 83 83 \item[{all data}] \leavevmode 84 84 the full graph with all profiles, components, elements and data categories and links between them. 85 86 Currently this amounts to roughly 2.000 nodes and 3.000 links 85 Currently this amounts to roughly 4.600 nodes and 7.500 links. 87 86 88 87 \item[{selected nodes}] \leavevmode 89 nodes explicitely selected by the user (see below how to \hyperref[select-nodes]{select nodes}) 88 nodes explicitely selected by the user (see below how to \hyperref[select-nodes]{select nodes}). 90 89 91 90 \item[{data to show}] \leavevmode … … 145 144 } 146 145 147 The navigation pane provides following optionto control the rendering of the graph:146 The navigation pane provides the following options to control the rendering of the graph: 148 147 % 149 148 \begin{description}
Note: See TracChangeset
for help on using the changeset viewer.