Changeset 3776
- Timestamp:
- 10/16/13 16:06:54 (11 years ago)
- Location:
- SMC4LRT
- Files:
-
- 18 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/Outline.tex
r3681 r3776 76 76 77 77 \listoffigures 78 \listoftodos79 \begin{comment}78 %\listoftodos 79 %\begin{comment} 80 80 \input{chapters/Introduction} 81 81 … … 83 83 84 84 85 \input{chapters/Definitions} 86 \end{comment}85 86 %\end{comment} 87 87 \input{chapters/Data} 88 88 89 \begin{comment}89 %\begin{comment} 90 90 91 91 \input{chapters/Infrastructure} … … 99 99 \input{chapters/Conclusion} 100 100 101 \end{comment}101 %\end{comment} 102 102 103 103 … … 108 108 \appendix 109 109 110 %\input{chapters/appendix} 110 \input{chapters/Definitions} 111 \input{chapters/appendix} 111 112 112 113 -
SMC4LRT/chapters/Conclusion.tex
r3665 r3776 8 8 % Dynamic integration of the information from the Relation Registry into the search interface and search processing. 9 9 10 A whole separate track is the effort to deliver the CMD data as \emph{Linked Open Data}, for which only the groundwork has been done by specifying the modelling of the data in RDF. Further steps are: setup of a processing workflow to apply the specified model and t ransform all the data (profiles and instances) into RDF, a server solution to host the data and allow querying it and finally, on top of it offera web interface for the users to explore the dataset.10 A whole separate track is the effort to deliver the CMD data as \emph{Linked Open Data}, for which only the groundwork has been done by specifying the modelling of the data in RDF. Further steps are: setup of a processing workflow to apply the specified model and to transform all the data (profiles and instances) into RDF, a server solution to host the data and to allow querying it and, eventually, a web interface for the users to explore the dataset. 11 11 12 12 %Irrespective of the additional levels - the user wants and has to get to the resource. (not always) to the "original" 13 And finally, a visualization tool for the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}. 14 Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features, that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain). 13 And finally, a visualization tool for exploring the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}. Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features, that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain). 15 14 16 15 Within the CLARIN community a number of (permanent) tasks has been identified and corresponding task forces have been established, 17 one of them being metadata curation. The results of this work represent a directly applicable groundworkfor this ongoing effort.16 one of them being metadata curation. The results of this work represent a directly applicable input for this ongoing effort. 18 17 One particularly pressing aspect of the curation is the consolidation of the actual values in the CMD records, a topic explicitly treated in this work. -
SMC4LRT/chapters/Data.tex
r3681 r3776 10 10 The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.) 11 11 CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information. 12 The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category \footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus12 The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus 13 13 indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}. 14 14 … … 17 17 While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}. 18 18 19 Once the profiles are defined they are transformed into a XML -Schema, that prescribes the structure of the instance records.19 Once the profiles are defined they are transformed into a XML Schema, that prescribes the structure of the instance records. 20 20 The generated schema also conveys as annotation the information about the referenced data categories. 21 21 … … 24 24 In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time. 25 25 26 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements 27 (when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}). 26 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\concept{dublincore}, \concept{collection}, the set of \concept{Bamdes}-profiles) there are complex profiles with up to 10 levels (\concept{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 distinct components and 337 elements (or 419 components and 1587 elements when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \concept{Contact}) included by three other components (\concept{Project}, \concept{Institution}, \concept{Access}) will appear three times in the instantiated record.}). 28 27 29 28 … … 136 135 Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts. 137 136 138 Some overview/survey works regarding existing formats are: The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} putting the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI??? 137 As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} pus the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE. 139 138 140 139 … … 150 149 \end{description} 151 150 152 T oday, Dublin Core metadata terms is very widely spread. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.151 The DCMI terms format is very widely spread nowadays. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers. 153 152 154 153 There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}. … … 160 159 \label{def:OLAC} 161 160 162 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}. 163 164 The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field , linguistic-type, language, role, discourse-type})161 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}. 162 163 The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field}, \code{role}, \code{linguistic-type}, \code{language}, \code{discourse-type}). 165 164 166 165 \begin{quotation} … … 234 233 One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint. 235 234 236 ? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}235 %? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology} 237 236 238 237 239 238 \subsection{ELRA} 240 239 241 European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources , mostly under license for a fee, although some resources are available for free as well.240 European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources (over 1.100) with focus on spoken resources, but also written, terminological and multimodal resources, mostly under license for a fee (although selected resources are available for free as well). 242 241 The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/} 243 242 Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world. … … 254 253 \subsection{LDC} 255 254 256 Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} is another provider of high quality curated language resources 257 255 Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is provided for a fee, more than 650 resources have been made available since 1993. The catalog is freely accessible. The metadata is additionally aggregated by OLAC archives. 258 256 259 257 \section{Formats and Collections in the World of Libraries} 260 261 There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even only the bibliographic records constitute sizable language resources in they own right. 258 \label{sec:lib-formats} 259 260 There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right. 262 261 263 262 %\item[LoC] Library of Congress \url{http://www.loc.gov} … … 280 279 Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html} 281 280 282 Metadata Object Description Schema- ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using language-based tags rather than numeric ones,281 \xne{Metadata Object Description Schema} - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using language-based tags rather than numeric ones, 283 282 more than Dublin Core. One of endorsed schemas to extend (be used inside) METS. 284 283 285 In 1998 a new Entitiy Relationship model - FRBR - Functional Requirements for Bibliographic Records 2002 \cite{FRBR1998} 286 and since ?? RDA - Resource Description and Access 284 There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as an comprehensive standard for resource description and discovery, that however was confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}. 285 And although there is still work on RDA, among others by the Library of Congress, there has been no wider adoption of the standard by the LIS community until now. 287 286 288 287 \subsection{ESE, Europeana Data Model - EDM} 289 288 290 Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently 291 292 originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is very limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}.293 EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the semantic data of Europeana.289 Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}. 290 291 For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}. 292 EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is also already a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the Europeana data in the new format. 294 293 %https://github.com/europeana 295 294 … … 304 303 Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees. 305 304 306 In the following we inventarize such resources , covering the domains expected in the dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the subsequent glossary.305 In the following we inventarize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary} 307 306 How this resources will be employed is discussed in \ref{sec:values2entities}. 307 Additionally, some verbose commentary follows. 308 308 309 309 %\subsubsection{Named entities} … … 312 312 Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html} 313 313 314 Yago is a large knowledge integrating dbpedia, geonames and ..?? 315 316 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010} 314 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010} 315 316 Also to mention \xne{Yago}, a large knowledge base created by MPI informatik integrating dbpedia, geonames and wordnet\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/} \cite{Suchanek2007yago}. 317 317 318 318 So we witness a strong general trend towards Semantic Web and Linked Open Data. … … 323 323 324 324 %\subsection{Concepts -- Classifications, Taxonomies, \dots} 325 326 327 \begin{comment} 328 329 VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID} 330 331 \subsection{schema.org} 332 http://schema.org/docs/datamodel.html 333 http://www.w3.org/wiki/WebSchemas/ExternalEnumerations 334 335 microdata or 336 http://www.w3.org/TR/rdfa-lite/ 337 Resource Description Framework in attributes 338 339 the entire WorldCat cataloging collection made publicly 340 available using Schema.org mark-up with library extensions for use by developers and 341 search partners such as Bing, Google, Yahoo! and Yandex 342 343 OCLC begins adding linked data to WorldCat by appending 344 Schema.org descriptive mark-up to WorldCat.org pages, thereby 345 making OCLC member library data available for use by intelligent 346 Web crawlers such as Google and Bing 347 348 \end{comment} 349 350 \section{Summary} 351 352 In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology. 353 We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities. 354 325 355 326 356 … … 345 375 & & \multicolumn{3}{l}{764,000 persons; 333,000 works; 192,000 organizations; 639,000 geographica } \\ 346 376 Yago \cite{Suchanek2007yago} & MPI Informatik & 1E7 / 1.2E8 & huge semantic KB (aggregated from Wikipedia, Wordnet, Geonames) & \href{http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html}{data dumps} \\ 347 \href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons, 4.600 organizations & ontology-based portal for Language Technology & \href{http://www.lt-world.org/kb/}{portal} \\ 377 \href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons& ontology-based portal for LRT & \href{http://www.lt-world.org/kb/}{portal} \\ 378 & & 4.600 organizations & & \\ 348 379 Geonames & Geonames & \textgreater 1E7 (2.8E6 / 5.5E6) & "modern" place names & data dump + web service \\ 349 380 PKND & prometheus & \textgreater 37,000 & persons, artists & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{XML dump} \\ … … 389 420 GND/s & DNB & 202.000 & subjects (Schlagwörter), universal, lang:de & \\ 390 421 GTAA & NISL & 3.800 & Subjects, describing TV programs & \href{http://datahub.io/de/dataset/gemeenschappelijke-thesaurus-audiovisuele-archieven}{(RDF) data dumps}, \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\ 391 DDC & OCLC & & universal classification by field of study, translated in multiple languages & \href{http://dewey.info/}{dewey.info} \\422 DDC & OCLC & & universal classification by field of study, multi langs & \href{http://dewey.info/}{dewey.info} \\ 392 423 UDC & & & & \\ 393 424 Wiki Categories & Wikipedia & 995,911& classification of Wiki articles as skos:Concepts & SKOS Vocabulary, SPARQL \\ 394 425 DBpedia Ontology & Wikipedia & 529 / 2333 & general classification of Wiki articles as ontology & \href{http://wiki.dbpedia.org/Ontology39?v=g9b}{RDF data}, SPARQL\\ 395 ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts in a number of thematic groups (Metadata, Lexical Resources, ...)& \href{http://www.isocat.org}{web-app}, service \\396 Object Names Thes aurus& British Museum & & classification of objects in the collection & \\397 Material Thes aurus& British Museum & & classification of material & \\398 Thes aurus ofMonument Types & British Museum & & types of monuments & \\426 ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts & \href{http://www.isocat.org}{web-app}, service \\ 427 Object Names Thes. & British Museum & & classification of objects in the collection & \\ 428 Material Thes. & British Museum & & classification of material & \\ 429 Thes. Monument Types & British Museum & & types of monuments & \\ 399 430 Hornbostel-Sachs-Systematik & & 300 categories & classification of musical instruments & \href{http://www.music.vt.edu/musicdictionary/texth/Hornbostel-Sachs.html}{web page} \\ 400 431 Oberbegriffsdatei & DMB & & a set of vocabularies for museums, lang:de & \url{museumsvokabular.de}, PDF, XML dumps\\ … … 408 439 \end{landscape} 409 440 410 \begin{description} 411 \item[AAT] international Architecture and Arts Thesaurus, Getty 412 \item[CONA] Cultural Objects Name Authority 413 \item[DAI] Deutsches ArchÀologisches Institut 414 \item[DDC] Dewey Decimal Classification 415 \item[DFKI] Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz 416 \item[DMB] Deutscher Museumsbund 417 \item[DNB] Deutsche National Bibliothek 418 \item[FAST] Faceted Application of Subject Terminology 419 \item[Getty] Getty Research Institute curating the vocabularies\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, part of Getty Trust 420 \item[GND] \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library 421 \item[GTAA] Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives) 422 \begin{quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation} 423 \item[ISO] International Standardization Organization 424 \item[LCCN] Library of Congress Control Number 425 \item[LCC] Library of Congress Classification 426 \item[LCSH] Library of Congress Subject Headings 427 \item[LoC] Library of Congress\furl{http://loc.gov} 428 \item[OCLC] Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation 429 \item[PKND] prometheus KÃŒnstlerNamensansetzungsDatei\furl{http://prometheus-bildarchiv.de/de/tools/pknd} 430 \item[RKD] Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History 431 \item[TGN] Getty Thesaurus of Geographic Names 432 \item[UDC] Universal Decimal Classification 433 \item[ULAN] Union List of Artist Names 434 \item[VIAF] Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries 435 \end{description} 436 437 438 \begin{comment} 439 440 VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID} 441 442 \subsection{schema.org} 443 http://schema.org/docs/datamodel.html 444 http://www.w3.org/wiki/WebSchemas/ExternalEnumerations 445 446 microdata or 447 http://www.w3.org/TR/rdfa-lite/ 448 Resource Description Framework in attributes 449 450 the entire WorldCat cataloging collection made publicly 451 available using Schema.org mark-up with library extensions for use by developers and 452 search partners such as Bing, Google, Yahoo! and Yandex 453 454 OCLC begins adding linked data to WorldCat by appending 455 Schema.org descriptive mark-up to WorldCat.org pages, thereby 456 making OCLC member library data available for use by intelligent 457 Web crawlers such as Google and Bing 458 459 \end{comment} 460 461 \section{Summary} 462 463 In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology. 464 We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications). 465 441 442 443 \begin{table} 444 \caption{Glossary of acronyms used in the overview of controlled vocabularies (tables \ref{table:data-ne}, \ref{table:data-concepts}) } 445 \label{table:vocab-glossary} 446 447 % \begin{tabu}{ >{\sffamily}l p{0.8\textwidth} 448 \begin{tabular}{ >{\sffamily}l p{0.8\textwidth}} 449 % \hline 450 %\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\ 451 % \hline 452 453 AAT & international Architecture and Arts Thesaurus, Getty \\ 454 CONA & Cultural Objects Name Authority \\ 455 DAI & Deutsches ArchÀologisches Institut \\ 456 DDC & Dewey Decimal Classification \\ 457 DFKI & Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz \\ 458 DMB & Deutscher Museumsbund \\ 459 DNB & Deutsche National Bibliothek \\ 460 FAST & Faceted Application of Subject Terminology \\ 461 Getty & Getty Research Institute curating the \href{http://www.getty.edu/research/tools/vocabularies/index.html}{vocabularies}, part of Getty Trust \\ 462 GND & \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library \\ 463 GTAA & Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for \& Audiovisual Archives) \\ 464 % {quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation} \\ 465 ISO & International Standardization Organization \\ 466 LCCN & Library of Congress Control Number \\ 467 LCC & Library of Congress Classification \\ 468 LCSH & Library of Congress Subject Headings \\ 469 LoC & Library of Congress\furl{http://loc.gov} \\ 470 OCLC & Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation \\ 471 PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{prometheus} KÃŒnstlerNamensansetzungsDatei\\ 472 RKD & Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History \\ 473 TGN & Getty Thesaurus of Geographic Names \\ 474 UDC & Universal Decimal Classification \\ 475 ULAN & Union List of Artist Names \\ 476 VIAF & Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries \\ 477 \end{tabular} 478 \end{table} 479 -
SMC4LRT/chapters/Definitions.tex
r3680 r3776 74 74 \end{definition} 75 75 76 \ noindent76 \begin{example1} 77 77 Example blocks, simple: 78 \begin{example1}79 Short piece of sample data80 78 \end{example1} 81 79 82 \noindent83 or with tabs (especially for RDF triples):84 80 \begin{example3} 85 my:work & my:example & my:block 81 or with & tabs (especially for & RDF triples) 86 82 \end{example3} -
SMC4LRT/chapters/Design_SMCinstance.tex
r3680 r3776 12 12 relevant parts in a triple store and do your SPARQL/reasoning on it. Well 13 13 that's where I'm ultimately heading with all these registries related to 14 semantic interoperability ... I hope ;-)\cite{Menzo2013mail} 14 semantic interoperability ... I hope ;-) 15 16 \hfill \textit{Menzo Windhouwer} \cite{Menzo2013mail} 15 17 \end{quotation} 18 16 19 17 20 As described in previous chapters (\ref{ch:infra},\ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values. … … 38 41 \subsection{CMD specification} 39 42 40 The main entity of the meta model is the CMD component and is typed as specialization of the \code{ owl:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation:43 The main entity of the meta model is the CMD component and is typed as specialization of the \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It would be natural to translate a CMD element to a RDF property, but it needs to be a class as a CMD element -- next to its value -- can also have attributes. This further implies a property ElementValue to express the actual value of given CMD element. 41 44 42 45 \label{table:rdf-spec} 43 46 \begin{example3} 44 cmds:Component & subClassOf & owl:Class. \\ 45 cmds:Profile & subClassOf & cmds:Component. \\ 46 cmds:Element & subClassOf & rdf:Property. \\ 47 \end{example3} 47 cmds:Component & a & rdfs:Class. \\ 48 cmds:Profile & rdfs:subClassOf & cmds:Component. \\ 49 cmds:Element & a & rdfs:Class. \\ 50 cmds:ElementValue & a & rdf:Property \\ 51 cmds:Attribute & a & rdf:Property \\ 52 \end{example3} 53 48 54 49 55 \noindent … … 56 62 & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\ 57 63 cmd:Actor & a & cmds:Component. \\ 58 cmd:LanguageName & a & cmds:Element. \\ 59 \end{example3} 60 61 \begin{note} 62 Should the ID assigned in the Component Registry for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness â generate the name from the cmd-path?) 63 \end{note} 64 cmd:Actor.LanguageName & a & cmds:Element. \\ 65 \end{example3} 66 67 %\begin{note} 68 %Should the ID assigned in the Component Registry for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness â generate the name from the cmd-path?) 69 %\end{note} 70 64 71 65 72 \subsection{Data Categories} … … 69 76 dcr:datcat & a & owl:AnnotationProperty ; \\ 70 77 & rdfs:label & "data category"@en ; \\ 71 & rdfs:comment & "This resource is equivalent to 78 & rdfs:comment & "This resource is equivalent to this data category."@en ; \\ 72 79 & skos:note & "The data category should be identified by its PID."@en ; \\ 73 80 \end{example3} … … 87 94 88 95 \noindent 89 Analogously, we could model \xne{ISOcat} data categories as data properties, i.e. metadata elements referencing ISOcat data categories could be encoded as follows: 90 91 \begin{example3} 92 <lr1> & isocat:DC-2502 & "19th century" 93 \end{example3} 94 95 \noindent 96 However, Windhouwer\cite{Windhouwer2012_LDL} argues against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications. 97 98 This raises the vice-versa question, whether to rather handle all data categories uniformly, which would mean encoding dublincore terms also as annotation properties, but the pragmatic view dictates to encode the data in line with the prevailing approach, i.e. express dublincore terms directly as data properties. 99 100 101 \noindent 102 The REST web service of \xne{ISOcat} provides a RDF representation of the data categories: 103 104 \begin{example3} 105 isocat:languageName & dcr:datcat & isocat:DC-2484; \\ 106 & rdfs:label & "language name"@en; \\ 107 & rdfs:comment & "A human understandable..."@en; \\ 108 & ⊠\\ 109 \end{example3} 110 111 However this is only meant as template, as is stated in the explanatory comment of the exported data: 112 113 \begin{quotation} 114 By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals. 115 \end{quotation} 116 117 So in a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals: 96 However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.\cite{Windhouwer2012_LDL} 97 In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals: 118 98 119 99 \begin{example3} … … 132 112 133 113 \noindent 134 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. 135 136 \begin{note} 137 Does this mean, that I would say: 138 \begin{example3} 139 rel:sameAs & owl:equivalentProperty & owl:sameAs 140 \end{example3} 141 142 to enable the inference of the equivalences? 143 144 Is this correct: 145 \end{note} 146 ?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.: 147 148 \begin{example2} 149 cmd:PublicationYear = 2012 $\rightarrow$ & dc:created = 2012 150 \end{example2} 151 152 \noindent 153 following facts need to be present in the ontology : 154 155 \begin{example3} 156 <lr1> & cmd:PublicationYear & 2012\^{}\^{}xs:year \\ 157 cmd:PublicationYear & owl:equivalentProperty & isocat:DC-2538 \\ 158 isocat:DC-2538 & rel:sameAs & dc:created \\ 159 rel:sameAs & owl:equivalentProperty & owl:sameAs \\ 160 $\rightarrow$ \\ 161 <lr1> & dc:created & 2012\^{}\^{}xs:year \\ 162 \end{example3} 163 164 \noindent 165 What about other relations we may want to express? (Do we need them and if yes, where to put them? â still in RR?) Examples: 166 167 \begin{example3} 168 cmd:MDCreator & owl:subClassOf & dcterms:Agent \\ 169 clavas:Organization & owl:subClassOf & dcterms:Agent \\ 170 <org1> & a & clavas:Organization \\ 171 \end{example3} 114 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping: 115 116 \begin{example3} 117 rel:sameAs & rdfs:subPropertyOf & owl:sameAs 118 \end{example3} 119 120 172 121 173 122 \subsection{CMD instances} … … 177 126 178 127 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>} from \code{cmd:MdSelfLink} element) could be used as the resource identifier. 179 If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}: 128 If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}. 129 (Note also, that one MD record can describe multiple resources, this can be also easily accomodated in OpenAnnotation): 180 130 181 131 \begin{example3} 182 132 \_:anno1 & a & oa:Annotation; \\ 183 & oa:hasTarget & <lr1 >; \\133 & oa:hasTarget & <lr1a>, <lr1b>; \\ 184 134 & oa:hasBody & <lr1.cmd>; \\ 185 135 & oa:motivatedBy & oa:describing \\ … … 192 142 \begin{example3} 193 143 <lr1.cmd> & dcterms:identifier & <lr1.cmd>; \\ 194 & dcterms:creator ??& "\var{\{cmd:MdCreator\}}"; \\195 & dcterms:publisher & <http://clarin.eu> , <provider-oai-accesspoint>; ??\\196 & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" ??\\144 & dcterms:creator & "\var{\{cmd:MdCreator\}}"; \\ 145 & dcterms:publisher & <http://clarin.eu>\\ 146 & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" \\ 197 147 \end{example3} 198 148 … … 207 157 & ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\ 208 158 \end{example3} 209 210 \noindent211 ?? Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation?212 Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part.213 This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}.214 Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected.215 216 \todocode{check consistency for MdCollectionDisplayName vs. IsPartOf in the instance data}217 218 \begin{example3}219 \_:mdcoll & a & ore:ResourceMap; \\220 & rdfs:label & "Collection 1"; \\221 \_:mdcoll\#aggreg & a & ore:Aggregation \\222 & ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\223 \end{example3}224 159 225 160 \subsubsection{Components â nested structures} 226 161 227 There are two variants to express the tree structure of the CMD records, i.e. the containment relation between the components: 228 229 \begin{enumerate}[a)] 230 \item the components are encoded as object property 231 232 \begin{example3} 233 <lr1> & cmd:Actor & \_:Actor1 \\ 234 <lr1> & cmd:Actor & \_:Actor2 \\ 235 \_:Actor1 & cmd:motherTongue & iso-639:aac \\ 236 \_:Actor2 & cmd:motherTongue & iso-639:deu \\ 237 \_:Actor1 & cmd:role & "Interviewer" \\ 238 \_:Actor2 & cmd:role & "Speaker" \\ 239 \end{example3} 240 241 \item a dedicated object property is used 162 For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used: 242 163 243 164 \begin{example3} … … 246 167 \end{example3} 247 168 248 \end{enumerate}249 250 169 \subsection{Elements, Fields, Values} 251 170 Finally, we want to integrate also the actual field values in the CMD records into the ontology. 252 171 253 \subsubsection{Predicates} 254 As explained before CMD elements are typed as \code{rdf:Property} with the corresponding data category expressed as annotation property: 172 % \subsubsection{Predicates} 173 As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property. 174 175 Following example show the whole chains of statements from metamodel to literal value: 255 176 256 177 \begin{example3} 257 178 cmd:timeCoverage & a & cmds:Element \\ 179 cmd:timeCoverageValue & a & cmds:ElementValue \\ 258 180 cmd:timeCoverage & dcr:datcat & isocat:DC-2502 \\ 259 <lr1> & cmd:timeCoverage & "19th century" \\ 260 261 \end{example3} 262 263 \subsubsection{Literal values -- data properties} 264 265 To generate triples with literal values is straightforward: 266 267 \begin{definition}{Literal triples} 268 lr:Resource \ \quad cmds:Property \ \quad xsd:string 269 \end{definition} 270 271 \begin{example3} 272 <lr1> & cmd:Organisation & "MPI" \\ 273 \end{example3} 274 275 \subsubsection{Mapping to entities -- object properties} 276 277 The more challenging but also more valuable aspect is to generate objectProperty triples with the literal values mapped to semantic entities: 278 279 \begin{definition}{new RDF triples} 280 lr:Resource \ \quad cmd:Property \ \quad xsd:anyURI 281 \end{definition} 282 283 \begin{example3} 284 <lr1> & cmd:Organisation\_? & <org1> \\ 285 \end{example3} 286 287 \begin{note} 181 <lr1> & cmd:contains & \_:timeCoverage1 \\ 182 \_:timeCoverage1 & a & cmd:timeCoverage \\ 183 \_:timeCoverage1 & cmd:timeCoverageValue & "19th century" \\ 184 \end{example3} 185 186 187 While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples with the literal values mapped to semantic entities: 188 189 \begin{example3} 190 \var{cmds:Element} & \var{cmds:ElementValue\_?} & \var{xsd:anyURI}\\ 191 \_:organisation1 & cmd:OrganisationValue\_? & <org1> \\ 192 \end{example3} 193 194 \begin{comment} 288 195 Don't we need a separate property (predicate) for the triples with object properties pointing to entities, 289 196 i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation} 290 \end{note} 291 292 The mapping process is detailed in \ref{sec:values2entities} 293 294 %%%%%%%%%%%%%%%%%55 197 \end{comment} 198 199 The mapping process is detailed in \ref{sec:values2entities}. 200 201 202 203 %%%%%%%%%%%%%%%%% 295 204 \section{Mapping field values to semantic entities} 296 205 \label{sec:values2entities} … … 310 219 We don't try to achieve complete ontology alignment, we just want to find 311 220 for our ``anonymous'' concepts semantically equivalent concepts from other ontologies. 312 This is very near just other phrasing forthe definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}:221 This is almost equivalent to the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}: 313 222 ``for each concept (node) in ontology A [tries to] find a corresponding concept 314 223 (node), which has the same or similar semantics, in ontology B and vice verse''. 315 224 316 225 The first two points in the above enumeration represent the steps necessary to be able to apply the ontology mapping. 317 The identification of appropriate vocabularies is discussed in the next subsection. In the operationalization, the identified vocabularies could be treated as one aggregated ontology to map all entities against. For the sake of higher precision, it may be sensible to perform the task separately for individual concepts, i.e. organisations, persons etc. and in every run consider only relevant vocabularies. 318 319 320 The transformation of the data has been partly described in previous section: 321 It can be trivially automatically converted into RDF triples as : 322 323 \begin{example3} 324 <lr1> & cmd:Organisation & "MPI" \\ 325 \end{example3} 326 327 However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept , value pairs: 328 329 \begin{example3} 330 \_:1 & a & cmd:Organisation;\\ 226 The identification of appropriate vocabularies is discussed in the next subsection. In the operationalization, the identified vocabularies could be treated as one aggregated semantic resource to map all entities against. For the sake of higher precision, it may be sensible to perform the task separately for individual concepts, i.e. organisations, persons etc. and in every run consider only relevant vocabularies. 227 228 The transformation of the data has been partly described in previous section. It can be trivially automatically converted into RDF triples as : 229 230 \begin{example3} 231 \_:organisation1 & cmd:OrganisationValue & "MPI" \\ 232 \end{example3} 233 234 However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept , value pairs (cf. figure \ref{fig:smc_cmd2lod}): 235 236 \begin{example3} 237 \_:1 & a & clavas:Organisation;\\ 331 238 & skos:altLabel & "MPI"; 332 239 \end{example3} … … 345 252 \subsubsection{Identify vocabularies} 346 253 347 \todoin{Identify related ontologies, vocabularies? - see DARIAH:CV} 348 LT-World \cite{Joerg2010} 349 350 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property (tentatively \code{@clavas:vocabulary}) in the schema or data category definition. For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly. 254 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}) . For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly. 351 255 352 256 The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}). … … 380 284 \end{definition} 381 285 382 In the implementation ,there needs to be additional initial configuration input, identifying datasets for given data categories,286 In the implementation there needs to be additional initial configuration input, identifying datasets for given data categories, 383 287 which will be the result of the previous step. 384 288 … … 409 313 \label{sec:lod} 410 314 411 412 With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset. 413 414 Namely to enhance it by employing ontological resources. 415 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects. 416 417 418 SPARQL 419 420 rechercheisidore, dbpedia, ... 421 422 423 \cite{Europeana RDF Store Report} 424 425 Technical aspects (RDF-store?): Virtuoso 426 427 428 semantic search component in the Linked Media Framework 429 430 \todoin{check SARQ}\furl{http://github.com/castagna/SARQ} 431 432 433 %\section {Full semantic search - concept-based + ontology-driven ?} 434 %\label{semantic-search} 435 315 With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility of exploring the dataset using external semantic resources. 316 The user can access the data indirectly by browsing external vocabularies/taxonomies, with which the data will be linked like vocabularies of organizations or taxonomies of resource types. 317 318 The technical base for a semantic web application is usually a RDF triple-store as discussed in \ref{semweb-tech}. 319 Given that our main concern is the data itself, their processing and display, we want to rely on stable, robust feature rich solution minimizing the effort to provide the data online. The most promising solution seems to be \xne{Virtuoso}, a integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). 320 321 322 Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset. 436 323 437 324 \section{Summary} 438 325 439 %The task can be also seen as building bridge between XML resources and semantic resources expressed in RDF, OWL. 440 441 The process of expressing the whole of the data as one semantic resource, can be also understood as schema or ontology merging task. Data categories being the primary mapping elements 442 443 444 In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration. 445 326 In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the method to translate the string values in metadata fields to corresponding semantic entities. 327 This task can be also seen as building a bridge between the world XML resources and semantic resources expressed in RDF. 328 Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration. 329 330 %The process of expressing the whole of the data as one semantic resource, can be also understood as schema or ontology merging task. Data categories being the primary mapping elements -
SMC4LRT/chapters/Design_SMCschema.tex
r3680 r3776 12 12 The SMC module is part of the CMD Infrastructure. It is a consumer of data from the production-side registries and serves search services on the exploitation side of the infrastructure, as well as third party applications accessing the joint CLARIN metadata domain. 13 13 14 \begin{figure*} [!ht]14 \begin{figure*} 15 15 \includegraphics[width=0.8\textwidth]{images/SMC_modules.png} 16 16 \caption{The component view on the SMC - modules and their inter-dependencies} … … 45 45 46 46 \subsection{smcIndex}\label{def:smcIndex} 47 In this section, we describe \ code{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.48 49 An \ code{smcIndex} is a human-readable string adhering to a specific syntax, denoting some search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.47 In this section, we describe \var{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces. 48 49 An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces. 50 50 51 51 \begin{defcap} 52 \caption{Grammar of \ code{smcIndex}}52 \caption{Grammar of \var{smcIndex}} 53 53 \begin{align*} 54 54 smcIndex &::= dcrIndex \ | \ cmdIndex \\ … … 67 67 \end{defcap} 68 68 69 The grammar distinguishes two main types of \ code{smcIndex}: a) \code{dcrIndex} referring to data categories and b) \code{cmdIndex} denoting a specific ``CMD entity'', i.e. an element (metadata field), component or whole profile defined within CMD (cf. \ref{def:CMD} for description of the CMD data model).70 These two types of \ code{smcIndex} follow different construction patterns.71 \ code{cmdIndex} has a recursive path-like structure and can be interpreted as a XPath-expression into the instances of CMD profiles. In contrast to it, \code{dcrIndex} consists of just one-level term and is generally not directly applicable on existing data. It can be understood as abstract index referring to well-defined concepts -- the data categories -- and for actual search it needs to be resolved to the set of CMD elements it is referred by. In return, one can expect to match more metadata fields from multiple profiles, all referring to the same data category.72 73 It is important to note , that in general -- by design -- \code{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.69 The grammar distinguishes two main types of \var{smcIndex}: a) \var{dcrIndex} referring to data categories and b) \var{cmdIndex} denoting a specific ``CMD entity'', i.e. an element (metadata field), component or whole profile defined within CMD (cf. \ref{def:CMD} for description of the CMD data model). 70 These two types of \var{smcIndex} follow different construction patterns. 71 \var{cmdIndex} has a recursive path-like structure and can be interpreted as a XPath-expression into the instances of CMD profiles. In contrast to it, \var{dcrIndex} consists of just one-level term and is generally not directly applicable on existing data. It can be understood as abstract index referring to well-defined concepts -- the data categories -- and for actual search it needs to be resolved to the set of CMD elements it is referred by. In return, one can expect to match more metadata fields from multiple profiles, all referring to the same data category. 72 73 It is important to note that in general \var{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique. 74 74 Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it. 75 75 However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar: 76 76 77 \ code{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \code{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \code{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.78 79 \ code{profile} is reference to a CMD profile. Again, dealing with the ambiguity, it can be either the name of the profile \code{profileName} or its identifier \code{profileId} as issued by the Component Registry (e.g. \code{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:77 \var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace. 78 79 \var{profile} is reference to a CMD profile. Again, it can be either the name of the profile \var{profileName} or -- for guaranteed unambiguous reference -- its identifier \var{profileId} as issued by the Component Registry (e.g. \var{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier: 80 80 81 81 \begin{example1} … … 84 84 \end{example1} 85 85 86 \noindent87 \ code{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to narrow down theambiguity.86 %\noindent 87 \var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity. 88 88 89 89 \subsection{Terms} 90 90 \label{datamodel-terms} 91 91 92 In abstract terms, the internal format is basically a table of indexes with information collected from the upstream registries or created during preprocessing. Main entity is \code{Term} that represents either a label of a data category, or a CMD entity (a CMD component or element). Further entities \code{Termset} and \code{Concept} are mainly used for logical grouping of the \code{Terms}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{lst:terms-schema}. 92 Here we describe the XML schema for internal representation of the processed data. 93 In abstract terms, the internal format is basically a table with information about indexes collected from the upstream registries or created during preprocessing. \code{Term} is main entity that represents either a label of a data category, or a CMD entity (a CMD component or element). \code{Termset} represents a logical collection of \code{Terms} (one profile or data categories of one type). \code{Concept} represents a data category and groups all corresponding terms. \code{Relation} is used to express relation between two \code{Concepts}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{lst:terms-schema}. 93 94 94 95 \subsubsection{Type \code{Term}} … … 96 97 \code{Term} is a polymorph data type, that can have different sets of attributes depending on the type of data it represents. 97 98 98 \begin{table}[h t]99 \begin{table}[h] 99 100 \caption{Attributes of \code{Term} when encoding data category} 100 101 \label{table:terms-attributes-datcat} 101 \begin{tabular}{ l | l | l } 102 attribute & allowed values & sample value\\ 102 \begin{tabu}{ p{0.1\textwidth} p{0.4\textwidth} >{\footnotesize}X } 103 \hline 104 \rowfont{\itshape\small} attribute & allowed values & sample value\\ 103 105 \hline 104 106 \var{concept-id} & PID given by DCR & \code{isocat:DC-2522} \\ … … 106 108 \var{type} & one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\ 107 109 \var{xml:lang} & two-letter language code (only for ISOcat) & \code{en}, \code{si} \\ 108 \end{tabular} 110 \hline 111 \end{tabu} 109 112 \end{table} 110 113 111 %\captionsetup{justification=raggedright, singlelinecheck=false} 112 \lstset{language=XML} 113 \begin{lstlisting}[label=lst:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category] 114 <Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat" 115 type="label" xml:lang="fr">nom de ressource</Term> 116 \end{lstlisting} 117 118 \begin{table}[ht] 114 \begin{table}[h] 119 115 \caption{Attributes of \code{Term} when encoding CMD entity} 120 116 \label{table:terms-attributes-cmd} 121 \begin{tabularx}{1\textwidth}{ l | X |X }122 %\begin{tabu}{1\textwidth}{ l | l | l } 123 attribute & allowed values & sample value\\117 \begin{tabu}{ p{0.1\textwidth} p{0.4\textwidth} >{\footnotesize}X } 118 \hline 119 \rowfont{\itshape\small} attribute & allowed values & sample value\\ 124 120 \hline 125 121 \var{id} & \var{cmdEntityId} as defined in \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1290431694487\#Url} \\ 126 \var{type} & one of ['CMD\_Element', 'CMD\_Component'] & \code{CMD\_Element}\\ 122 \var{type} & {\footnotesize \code{CMD\_Element} | \code{CMD\_Component} } & \code{CMD\_Element}\\ 123 \var{datcat} & reference to the data category, URL or \var{dcrIndex} & \code{isocat:DC-2546}\\ 127 124 \var{name} & name of the component or element & \code{Url} \\ 128 125 \var{path} & \var{dotPath} (cf. \ref{def:smcIndex}) & \code{SpeechCorpus.Access.Contact.Url} \\ 129 126 \var{parent} & name of the parent component & \code{Contact} \\ 130 \end{tabularx} 127 \hline 128 \end{tabu} 131 129 \end{table} 132 130 133 \lstset{language=XML} 134 \begin{lstlisting}[label=lst:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element] 135 <Term type="CMD_Element" name="Url" datcat="http://www.isocat.org/datcat/DC-2546" 136 id="clarin.eu:cr1:c_1290431694487#Url" parent="Contact" 137 path="SpeechCorpus.Access.Contact.Url"/> 138 \end{lstlisting} 139 140 \begin{table}[ht] 141 \caption{Attributes of \code{Term} when encoding a term in the inverted index?} 131 \begin{table} 132 \caption{Attributes of \code{Term} when encoding a CMD entity in the inverted index} 142 133 \label{table:terms-attributes-index} 143 \begin{tabularx}{1\textwidth}{ l | X | X } 144 attribute & allowed values & sample value\\ 134 \begin{tabu}{ p{0.1\textwidth} p{0.4\textwidth} >{\footnotesize}X } 135 \hline 136 \rowfont{\itshape\small} attribute & allowed values & sample value\\ 145 137 \hline 146 138 \var{id} & \var{cmdEntityId} cf. \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1359626292113 \#ResourceTitle} \\ 147 \var{type} & one of \code{['id', 'mnemonic', 'label', 'full-path']} & \code{full-path}\\ 139 \var{set} & denotion of the containing termset & \code{cmd} \\ 140 \var{type} & one of \code{full-path} or \code{min-path} & \code{full-path}\\ 148 141 \var{schema} & \var{profileID} & \code{clarin.eu:cr1:p\_1357720977520} \\ 149 \var{concept-id} & id of the corresponding (data category) & \var{isocat:}\code{DC-2545} \\142 % \var{concept-id} & id of the corresponding (data category) & \var{isocat:}\code{DC-2545} \\ 150 143 \var{node-value} & \var{dotPath} & \code{SpeechCorpus.Access.Contact.Url} \\ 151 \end{tabularx} 144 \hline 145 \end{tabu} 152 146 \end{table} 153 147 148 %\captionsetup{justification=raggedright, singlelinecheck=false} 149 \lstset{language=XML} 150 \begin{lstlisting}[label=lst:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category] 151 <Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat" 152 type="label" xml:lang="fr">nom de ressource</Term> 153 \end{lstlisting} 154 155 \lstset{language=XML} 156 \begin{lstlisting}[label=lst:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element] 157 <Term type="CMD_Element" name="Url" id="clarin.eu:cr1:c_1290431694487#Url" 158 parent="Contact" datcat="http://www.isocat.org/datcat/DC-2546" 159 path="SpeechCorpus.Access.Contact.Url"/> 160 \end{lstlisting} 161 154 162 \lstset{language=XML} 155 163 \begin{lstlisting}[label=lst:terms-attributes-index, caption=sample \code{Term} element encoding a term in the inverted index] 156 157 158 164 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520" 165 id="clarin.eu:cr1:c_1359626292113#ResourceTitle" 166 concept-id="http://www.isocat.org/datcat/DC-2545" > 159 167 AnnotatedCorpusProfile.GeneralInfo.ResourceTitle 160 168 </Term> 161 169 \end{lstlisting} 162 170 163 171 164 172 \subsubsection{Type \code{Concept}} 165 \code{Concept} represents a data category. Identifier is the PID issued by the DCR .173 \code{Concept} represents a data category. Identifier is the PID issued by the DCR encoded in the \var{id} attribute. 166 174 It groups all terms belonging to given data category. 167 175 The content model is a sequence of \code{Terms} followed by a sequence of \code{info} elements. 168 Initially, after loading from DCR, a \code{Concept} contains only \code{Term}s of type: \code{id, mnemonic, label} encoding the corresponding attributes of the data category, followed by \code{info} elements holding the definition potentially in different languages: 176 Initially, after loading from DCR, a \code{Concept} contains only \code{Term}s of type: \code{id, mnemonic, label} (in multiple languages) encoding the corresponding attributes of the data category, followed by \code{info} elements holding the definition (also potentially in different languages). In the inverted index, the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{lst:dcr-cmd-map}). 177 169 178 170 179 \lstset{language=XML} 171 180 \begin{lstlisting}[label=lst:concept, caption=sample \code{Concept} element representing the data category \concept{resourceTitle}] 172 <Concept xmlns:dcif="http://www.isocat.org/ns/dcif" type="datcat" 173 id="http://www.isocat.org/datcat/DC-2545"> 174 <Term set="isocat" type="mnemonic">resourceTitle</Term> 175 <Term set="isocat" type="id">DC-2545</Term> 176 <Term set="isocat" type="label" xml:lang="en">resource title</Term> 177 <Term set="isocat" type="label" xml:lang="fi">resurssin otsikko</Term> 181 <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat"> 182 <Term set="isocat" type="mnemonic">resourceTitle</Term> 183 <Term set="isocat" type="id">DC-2545</Term> 184 <Term set="isocat" type="label" xml:lang="en">resource title</Term> 185 <Term set="isocat" type="label" xml:lang="fi">resurssin otsikko</Term> 186 ... 187 <info xml:lang="en">The title is the complete title 188 of the resource without any abbreviations.</info> 189 ... 190 </Concept> 191 \end{lstlisting} 192 193 %\lstset{language=XML} 194 %\begin{lstlisting}[label=lst:concept-cmd-term, caption=\code{Term} for CMD element added to %\code{Concept}] 195 % <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620" 196 % id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term> 197 %\end{lstlisting} 198 199 \lstset{language=XML} 200 \begin{lstlisting}[label=lst:dcr-cmd-map, caption=Sample of the inverted index \code{Concept} $\mapsto$ \code{Term}] 201 <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat"> 202 <Term set="isocat" type="mnemonic">resourceTitle</Term> 203 <Term set="isocat" type="id">DC-2545</Term> 204 <Term set="isocat" type="label" xml:lang="en">resource title</Term> 205 <Term set="isocat" type="label" xml:lang="hr">naslov resursa</Term> 206 <Term set="isocat" type="label" xml:lang="lv">resursa nosaukums</Term> 207 ... 208 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520" 209 id="clarin.eu:cr1:c_1359626292113#ResourceTitle"> 210 AnnotatedCorpusProfile.GeneralInfo.ResourceTitle</Term> 211 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880" 212 id="clarin.eu:cr1:c_1271859438123#Title"> 213 AnnotationTool.GeneralInfo.Title</Term> 214 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204" 215 id="clarin.eu:cr1:c_1271859438201#Title"> 216 Session.Title</Term> 178 217 ... 179 <info xml:lang="en">The title is the complete title 180 of the resource without any abbreviations.</info> 181 ... 182 </Concept> 183 \end{lstlisting} 184 185 In the inverted index the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{lst:concept-cmd-term}). 186 187 \lstset{language=XML} 188 \begin{lstlisting}[label=lst:concept-cmd-term, caption=\code{Term} for CMD element added to \code{Concept}] 189 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620" 190 id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term> 191 \end{lstlisting} 192 193 \lstset{language=XML} 194 \begin{lstlisting}[label=lst:dcr-cmd-map, caption=Sample of the inverted index \code{Concept} $\mapsto$ \code{Term}] 195 <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat"> 196 <Term set="isocat" type="mnemonic">resourceTitle</Term> 197 <Term set="isocat" type="id">DC-2545</Term> 198 <Term set="isocat" type="label" xml:lang="en">resource title</Term> 199 <Term set="isocat" type="label" xml:lang="hr">naslov resursa</Term> 200 <Term set="isocat" type="label" xml:lang="lv">resursa nosaukums</Term> 201 ... 202 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520" 203 id="clarin.eu:cr1:c_1359626292113#ResourceTitle"> 204 AnnotatedCorpusProfile.GeneralInfo.ResourceTitle</Term> 205 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880" 206 id="clarin.eu:cr1:c_1271859438123#Title"> 207 AnnotationTool.GeneralInfo.Title</Term> 208 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885" 209 id="clarin.eu:cr1:c_1274880881884#Title"> 210 imdi-corpus.Corpus.Title</Term> 211 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204" 212 id="clarin.eu:cr1:c_1271859438201#Title"> 213 Session.Title</Term> 214 ... 215 </Concept> 216 \end{lstlisting} 217 218 </Concept> 219 \end{lstlisting} 220 % <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885" 221 % id="clarin.eu:cr1:c_1274880881884#Title"> 222 % imdi-corpus.Corpus.Title</Term> 223 224 \subsubsection{Type \code{Relation}} 225 As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}). The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated, that contain more than two equivalent concepts. 226 227 % role="about" 228 \begin{lstlisting}[label=lst:dcr-cmd-map, caption=Internal representation of the relation between concepts] 229 <Relation type="sameAs"> 230 <Concept type="datcat" id="http://www.isocat.org/datcat/DC-2484"/> 231 <Concept type="datcat" id="http://purl.org/dc/elements/1.1/language"/> 232 </Relation> 233 \end{lstlisting} 218 234 219 235 \subsubsection{Type \code{Termsets/Termset}} 220 \code{Termset} groups a set of terms as outlined in \ref{table:cx-list-params}. It is identified by the \code{@set} attribute. 221 For example all french labels of isocat data categories under the identifier \code{isocat-fr} build a termset, as well as all the full-paths of one profile. 222 223 Finally, \code{Termsets} is a root element grouping \code{Termset} elements. 236 \code{Termset} groups a set of terms. (Possible termsets are listed in table \ref{table:cx-list-params}.) It is identified by the \code{@set} attribute. 237 For example all french labels of isocat data categories under the identifier \code{isocat-fr} build a termset, as well as all the full-paths of one profile. The content of the \code{Termset} can optionally begin with an \code{info} element (conveying information as provided by the source registry, like definition, creation date or author) followed by a flat or nested list of \code{Term} elements. Finally, \code{Termsets} is a root element grouping \code{Termset} elements. 224 238 225 239 \lstset{language=XML} 226 240 \begin{lstlisting}[label=lst:termset, caption=\code{Termset} element representing a CMD profile] 227 <Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520"241 <Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520" 228 242 type="CMD_Profile"> 229 <info> 230 <id>clarin.eu:cr1:p_1357720977520</id> 231 <description>A CMDI profile for annotated text corpus resources.</description> 232 <name>AnnotatedCorpusProfile</name> 233 <registrationDate>2013-01-31T11:57:12+00:00</registrationDate> 234 <creatorName>nalida</creatorName> 235 ... 236 </info> 237 <Term type="CMD_Component" name="GeneralInfo" datcat="" 243 <info> 244 <id>clarin.eu:cr1:p_1357720977520</id> 245 <description>A CMDI profile for annotated text corpus resources. 246 </description> 247 <name>AnnotatedCorpusProfile</name> 248 <registrationDate>2013-01-31T11:57:12+00:00</registrationDate> 249 <creatorName>nalida</creatorName> 250 ... 251 </info> 252 <Term type="CMD_Component" name="GeneralInfo" datcat="" 238 253 id="clarin.eu:cr1:c_1359626292113" 239 254 parent="AnnotatedCorpusProfile" 240 255 path="AnnotatedCorpusProfile.GeneralInfo"> 241 256 <Term ... 242 257 </Term> 243 258 ... 244 </Termset> 245 \end{lstlisting} 246 247 The content of the \code{Termset} can optionally begin with an \code{info} element (conveying information as provided by the source registry, like definition, creation date or author) followed by a flat or nested list of \code{Term} elements. 248 259 </Termset> 260 \end{lstlisting} 249 261 250 262 %%%%%%%%%%%%%%%%%%%%%% … … 255 267 Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}. 256 268 257 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).258 259 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm , but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points),instead of a collection of pair-wise links between fields.269 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications representing the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}). 270 271 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields. 260 272 261 273 \subsection{Interface Specification} … … 264 276 In this section, we define the abstract interface of the proposed service, in terms of the input parameters and output data format. 265 277 266 \todoin{The two interfaces list and map 267 Full definition in appendix and under link!} 268 269 \subsubsection*{Method \code{list}} 270 271 Method \code{list} lists available items for given context or type. This allows the client applications to configure the query input and provide autocompletion functionality. 272 273 \begin{definition}{URI-pattern of the \code{list} method} 278 %\todoin{The two interfaces list and map Full definition in appendix and under link!} 279 280 \subsubsection*{Method \var{list}} 281 282 Method \var{list} lists available items for given context or type. This allows the client applications to configure the query input and provide autocompletion functionality. Table \ref{table:cx-list-params} lists the accepted values for the \var{\$context} parameter and the corresponding types of returned data. 283 284 \begin{definition}{URI-pattern of the \var{list} method}\label{def:list-method} 274 285 /smc/cx/list/\$context 275 286 \end{definition} 276 277 \noindent278 Table \ref{table:cx-list-params} lists the allowed values for the \var{\$context} parameter and the corresponding types of returned data279 287 280 288 \begin{table} 281 289 \caption{Allowed values for parameters of the \code{list}-method and corresponding return values} 282 290 \label{table:cx-list-params} 283 \begin{tabular}{ l | p{0.7\textwidth} } 284 \var{\$context} & returns a list of \\ 285 \hline 291 % \begin{tabular}{ l | p{0.7\textwidth} } 292 % \var{\$context} & returns a list of \\ 293 \begin{tabu}{ l p{0.7\textwidth} } 294 \hline 295 \rowfont{\itshape\small} \$context & returns a list of \\ 296 \hline 286 297 \code{*,top} & available termsets \\ 287 298 \var{\{termset\}} & terms (CMD components and elements) of given termset \\ … … 292 303 \code{cmd-full-paths} & all complete (starting from Profile) \emph{dotPaths} to CMD components and elements\\ 293 304 \code{cmd-minimal-paths} & reduced but still unique paths to CMD components and elements \\ 294 \code{relsets} & available relation sets (defined in the Relation Registry) 295 \end{tabular} 305 \code{relsets} & available relation sets (defined in the Relation Registry) \\ 306 \hline 307 \end{tabu} 296 308 \end{table} 297 309 298 Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry. 310 \subsubsection*{Method \var{explain} } 311 The service also has to deliver additional information about the indexes like description and a link to the definition of the entity in the source registry. 312 313 \begin{definition}{URI-pattern of the \code{explain} method}\label{def:explain-method} 314 /smc/cx/explain/\{\$context\} \ [ \ /\{\$term\} \ ] \ [ \ ?format=\$format \ ] \ [ \ ?lang=\$lang \ ] 315 \end{definition} 316 317 \begin{example1} 318 /smc/cx/explain/cmd/clarin.eu:cr1:p\_1357720977520 \\ 319 /smc/cx/explain/isocat/DC-2506?lang=et,pt 320 \end{example1} 321 322 \lstset{extendedchars=false, 323 escapeinside='', language=XML} 324 \begin{lstlisting}[label=lst:sample-explain, caption=Sample output of the \var{explain} function for a data category] 325 <Concept type="datcat" id="http://www.isocat.org/datcat/DC-2506"> 326 <Term set="isocat" type="mnemonic">annotationMode</Term> 327 <Term set="isocat" type="id">DC-2506</Term> 328 <Term set="isocat" type="label" xml:lang="et">m'À'rgendusviis</Term> 329 <Term set="isocat" type="label" xml:lang="pt">modo de anota'çã'o</Term> 330 <info xml:lang="et">N'À'itab, kas ressurss m'À'rgendati 331 k'À'sitsi v'\~{o}'i automaatselt.</info> 332 <info xml:lang="pt">Indica se o recurso foi criado manualmente 333 ou por processo autom'á'tico.</info> 334 </Concept> 335 \end{lstlisting} 336 299 337 %NO (this will be handled by the servic as multililngual labels e) : or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category.} 300 338 % While it is desirable to also allow the Name-attribute of the data category (\texttt{telephone number}), especially also the Names defined in other working languages (\texttt{numero di telefono@it, numer telefonu@pl}), special care has to be taken here as these attributes mostly contain white spaces, which could cause problems in downstream components, when parsing a complex query containing such indices. 301 339 302 340 303 \subsubsection*{Method \ code{map} }304 305 Method \ code{map} performs the actual translations:341 \subsubsection*{Method \var{map} } 342 343 Method \var{map} performs the actual translations: 306 344 it accepts any index (adhering to the \var{smcIndex} datatype, cf. \ref{def:smcIndex}) and returns a list of corresponding indexes. 307 345 %it returns list of equivalent terms/smcIndexes for a given term/smcIndex. 308 346 309 \begin{definition}{General function definition} 310 smcIndex \mapsto smcIndex [ ]347 \begin{definition}{General function definition}\label{def:map-method-general} 348 smcIndex \mapsto smcIndex* 311 349 \end{definition} 312 350 313 \begin{definition}{URI-pattern of the \ code{map} method}351 \begin{definition}{URI-pattern of the \var{map} method} 314 352 /smc/cx/map/\{\$context\}/\{\$term\} \ [ \ ?format=\{\$format\} \ ] \ [ \ \&relset=\{\$relset\} \ ] 315 353 \end{definition} 316 354 317 355 \noindent 318 Parameter definition: \\*356 Parameter definition: 319 357 \begin{description} 320 \item[\var{\$context}] identifies the context to search in for the \var{\$term}, primarily this would beone of \code{[*, isocat, dc, cmd]}, in extended mode any of terms listed in table \ref{table:cx-list-params} is accepted358 \item[\var{\$context}] identifies the context to search in for the \var{\$term}, primarily this is one of \code{[*, isocat, dc, cmd]}, in extended mode any of terms listed in table \ref{table:cx-list-params} is accepted 321 359 \item[\var{\$term}] \var{smcIndex} term (without the context prefix); the term is used to lookup a concept, to deliver the list of equivalent indexes; case-insensitive 322 360 \item[\var{\$format}] the desired result format can be indicated explicitely, alternatively to default content negotiation; one of \code{[json, rdf, xml]}; \code{xml} is default 323 \item[\var{\$relset}] optional; reference to a rel set to be applied on the identified concept to expand the cluster of equivalent ; allows multiple values from \code{list/relsets}; if multiple sets arethey are all applied in the expansion361 \item[\var{\$relset}] optional; reference to a relation set to be combined with the identified concept to expand the cluster of matching concepts; allows multiple values from \code{list/relsets}; if multiple sets are listed they are all applied in the expansion 324 362 \end{description} 325 363 … … 327 365 Possible return formats: 328 366 \begin{description} 329 \item[\var{'', default}] internal XML format with all attributes (\xne{Terms.xsd}, cf. listing \ref{lst:map-output}) 330 331 367 \item[\var{default}] internal XML format with all attributes (\xne{Terms.xsd}, cf. listing \ref{lst:map-output}) 332 368 \item[\var{schema}] distinct schemas (\code{Termset}) referencing given data category or string 333 369 \lstset{language=XML} … … 335 371 <Termset schema="clarin.eu:cr1:p_1295178776924" name="serviceDescription"/> 336 372 \end{lstlisting} 337 \item[\var{datcat}] distinct data categories (\code{Term@id@da}) by \code{@concept-id}373 \item[\var{datcat}] distinct data categories, by grouping the \code{Term@datcat} attribute of the matching terms 338 374 \lstset{language=XML} 339 375 \begin{lstlisting} … … 341 377 set="isocat" type="datcat">creatorFullName</Term> 342 378 \end{lstlisting} 343 \item[\var{cmdid, id}] distinct cmd entities (\code{Term})by \code{@id}379 \item[\var{cmdid, id}] distinct cmd entities grouped by \code{@id} 344 380 \begin{lstlisting} 345 381 <Term type="CMD_Element" name="Name" elem="Name" parent="Session" … … 350 386 \end{description} 351 387 352 \begin{table}[ht]353 \caption{Sample values for parameters of the \code{map}-method and corresponding return values}354 \label{table:cx-map-params}355 356 \begin{tabular}{ l l | l}357 \var{\$context} & \var{\$term} & returns \\358 \hline359 \code{*} & \code{name} & ? \\360 \code{isocat} & \code{resourceTitle} & CMD terms \\361 \code{cmd} & \code{name} & \\362 363 \end{tabular}364 \end{table}365 366 388 \noindent 367 389 Sample request\\* … … 371 393 \lstset{language=XML} 372 394 \begin{lstlisting}[label=lst:map-output, caption=Corresponding sample output ] 373 <Terms 395 <Termset> 374 396 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880" 375 376 397 id="clarin.eu:cr1:c_1271859438123#Title"> 398 AnnotationTool.GeneralInfo.Title</Term> 377 399 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1288172614014" 378 379 400 id="clarin.eu:cr1:c_1288172614011#resourceTitle"> 401 BamdesLexicalResource.BamdesCommonFields.resourceTitle 380 402 </Term> 381 403 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885" 382 383 404 id="clarin.eu:cr1:c_1274880881884#Title"> 405 imdi-corpus.Corpus.Title</Term> 384 406 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204" 385 386 407 id="clarin.eu:cr1:c_1271859438201#Title"> 408 Session.Title</Term> 387 409 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1272022528363" 388 389 410 id="clarin.eu:cr1:c_1271859438123#Title"> 411 LexicalResourceProfile.LexicalResource.GeneralInfo.Title</Term> 390 412 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1284723009187" 391 id="clarin.eu:cr1:c_1271859438123#Title">collection.GeneralInfo.Title</Term> 413 id="clarin.eu:cr1:c_1271859438123#Title"> 414 collection.GeneralInfo.Title</Term> 392 415 \end{lstlisting} 393 416 … … 420 443 421 444 \noindent 422 (3) \emph{container data categories} -- further expansions will be possible once the \emph{container data categories} \cite{SchuurmanWindhouwer2011} will be used.\footnote{Although metadata modellers are encouraged to indicate data categories for both components and element , this is taking up only slowly and currently only around 14 per cent of the components have a data category specified.} The idea is to set a concept link also for the components, meaning that besides the ``atomic'' data category for \concept{actorName}, there would be also a data category for the complex concept \concept{Actor}.445 (3) \emph{container data categories} -- further expansions will be possible once the \emph{container data categories} \cite{SchuurmanWindhouwer2011} will be used.\footnote{Although metadata modellers are encouraged to indicate data categories for both components and elements, this is taking up only slowly and currently only around 14 per cent of the components have a data category specified.} The idea is to set a concept link also for the components, meaning that besides the ``atomic'' data category for \concept{actorName}, there would be also a data category for the complex concept \concept{Actor}. 423 446 Having concept links also on components will require a compositional approach for the mapping function, resulting in: 424 447 \begin{example2} … … 429 452 \subsection{Implementation} 430 453 431 The core functionality of the SMC is implemented as a set of XSL-stylesheets432 433 454 At the core of the described module is a set of XSL-stylesheets, governed by an ant-build file and a configuration file holding the information about individual source registries. 434 435 \todoin{generate and reference XSLT-documentation} 455 The documentation of the XSLT stylesheets and the build process is found in appendix \ref{sec:smc-xsl-docs}. 436 456 437 457 The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set, that the users cannot change directly. (The changes have to be performed in the upstream registries.) … … 440 460 \subsubsection{Initialization} 441 461 \label{smc_init} 442 During initialization the application fetches the information from the source modules (cf. \ref{def:CMDI}) and transforms it into the internal \xne{Terms} format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories :443 444 \begin{definition}{Principal structure of the inverted index} 445 datcat URI \mapsto profile.component.element[]462 During initialization the application fetches the information from the source modules (cf. \ref{def:CMDI}) and transforms it into the internal \xne{Terms} format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories.\ref{def:inverted-index} 463 464 \begin{definition}{Principal structure of the inverted index}\label{def:inverted-index} 465 datcatPID \mapsto profile.component.element* 446 466 \end{definition} 447 467 448 468 The collected data categories are enriched with information from corresponding registries (DCRs), adding the label, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface. 449 450 469 Finally, relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories. 451 470 452 \begin{figure*} [!ht]471 \begin{figure*} 453 472 \includegraphics[width=1\textwidth]{images/smc_init.png} 454 473 \caption{The various stages of the data flow during the initialization} … … 461 480 \item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles 462 481 \item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile 463 \item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements 482 \item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements encoding its properties (\code{id, label} 464 483 \item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map}) 465 484 \item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute … … 467 486 468 487 \subsubsection{Operation} 469 For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL -stylesheets for post-processing depending on requested format.470 The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq} -library within a \xne{eXist} XML-database.488 For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format. 489 The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq} library within an \xne{eXist} XML database. 471 490 472 491 \subsection{Extensions} … … 474 493 Once there will be overlapping\footnote{i.e. different relations may be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function. 475 494 476 Also, use of \emph{other than equivalenc y} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.495 Also, use of \emph{other than equivalence} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio. 477 496 478 497 \section{qx -- concept-based search} 479 498 \label{sec:qx} 480 499 To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata. 481 In this section we want to explore ,how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.500 In this section we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user. 482 501 483 502 The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily. 484 503 485 Note, that \emph{query expansion} yet needs to distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).486 487 Note, also that this chapter deals only with the schema-level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The corresponding instance level is tackled in \ref{semantic-search}.504 Note, that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is dealt with in \ref{semantic-search}. 505 506 Note, also that \emph{query expansion} yet needs to be distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath). 488 507 489 508 \subsection{Query language} 509 \label{cql} 490 510 As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind. 511 CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50\cite{Lynch1991}, which is very widely spread in the library networks. 512 It was introduced 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been 513 transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.) 514 515 Coming from the libraries world, the protocol has a certain bias in favor of bibliographic metadata. 516 However, the protocol is defined in a very generic way, with a strong focus on extensibility. 517 It is equally suitable for content search. 518 \begin{comment} 519 The protocol part (SRU) defines three major operations: 520 1) \emph{explain}: in which the target repository announces its particular configuration (e.g. available indices), 521 2) \emph{scan}: informing about terms available in/for given index, and 522 3) \emph{searchRetrieve}: returning a search result based on a CQL query. 523 \end{comment} 524 525 The query language part (CQL - Context Query Language) defines a relatively complex and complete query language. 526 The decisive feature of the query language is its inherent extensibility allowing to define own indexes and operators. 527 In particular, CQL introduces so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}. 528 529 The SRU/CQL protocol has also been adopted by the CLARIN community as base for a protocol for federated content search\furl{http://clarin.eu/fcs} (FCS) \cite{stehouwer2012fcs}, which is another argument to use this protocol for metadata search as well, given the inherent interrelation between metadata and content search. 491 530 492 531 \subsection{Query Expansion} … … 501 540 \end{example1} 502 541 503 \noindent 542 %\begin{note} 504 543 Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}). 544 %\end{note} 505 545 506 546 \subsection{SMC as module for Metadata Repository} … … 508 548 As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}). 509 549 510 Metadata repository is implemented in xquery running within the eXist XML-database as a web application. 511 512 There is also a XQuery implementation, that is integrated as a module of the SADE/cr-xq - eXist-based web application framework for publishing resources, on which the Metadata Repository is running. 513 514 515 \begin{figure*}[!ht] 550 Metadata repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq} module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module, that provides a user interface widget for formulating the query. 551 552 \begin{figure*} 553 \begin{center} 516 554 \includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png} 517 555 \caption{The component view on the SMC - modules and their inter-dependencies} 518 556 \label{fig:modules-mdrepo} 557 \end{center} 519 558 \end{figure*} 520 559 … … 522 561 \subsection{User Interface} 523 562 524 A starting point for our considerations is the traditional structure found in many ( advanced) search interface, which is basically a an array of index - term pairs, or in more advanced alternatives: tuples of index, comparison operator, term and boolean operator:563 A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically a an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries. 525 564 \begin{definition}{Generic data format for structured queries} 526 [ < index, operation, term, boolean > ]565 < index, operation, term, boolean >+ 527 566 \end{definition} 528 567 529 \noindent530 This maps trivially to the main clause of the CQL syntax, the \var{searchClause} \ref{def:searchClause}.531 568 % {Basic clause of the CQL syntax} 532 \begin{definition}{The main clause of the CQL syntax, the \code{searchClause}}569 \begin{definition}{The basic \code{searchClause} of the CQL syntax} 533 570 \label{def:searchClause} 534 571 searchClause \ ::= \ index \ relation \ searchTerm 535 572 \end{definition} 536 573 537 \noindent 538 An alternative would be a smart parsing input field with contextual autocomplete. Though such a widget would still share the underlying data model. 539 540 \begin{figure*}[!ht] 574 \begin{figure*} 575 \begin{center} 541 576 \includegraphics[width=0.8\textwidth]{images/query_input_autocomplete_term.png} 542 577 \caption{A proposed query input interface offering concepts as search indexes} 543 578 \label{fig:query_input} 579 \end{center} 544 580 \end{figure*} 545 581 546 582 \noindent 547 583 Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions. 548 549 A fundementally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.) 550 551 Although we concentrate on query input, the use of indexes has to be consistent across, be it in labeling the fields of the results, or when providing facets to drill down the search. 552 553 584 Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labeling the fields of the results, or when providing facets to drill down the search. 585 586 A fundamentally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.) 587 588 Combining the two approaches, we could arrive at a ``smart'' widget a input field with on the fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}. 589 590 591 %%%%%%%%%%%%%%%%%%%%%%%%%% 554 592 \section{SMC Browser} 555 593 \label{smc-browser} … … 597 635 \includegraphics[width=1\textwidth]{images/smc-browser_UIsketch.png} 598 636 \end{center} 599 \caption{A sketch of a possible layout for the SMC Browser -- individual parts of the user interface }637 \caption{A sketch of a possible layout for the SMC Browser -- individual parts of the user interface and the update dependencies} 600 638 \label{fig:smc-browser_sketch} 601 639 \end{figure*} … … 604 642 Prospective parts of the application layout (cf. figure \ref{fig:smc-browser_sketch}): 605 643 \begin{description} 606 \item[index pane l] list of all available nodes (profiles, components, elements, data categories); allows to select nodes to be displayed in the graph pane644 \item[index pane] list of all available nodes (profiles, components, elements, data categories); allows to select nodes to be displayed in the graph pane 607 645 \item[main graph pane] displays the selected subgraph, needs as much space as possible 608 646 \item[graph navigation bar] for manipulation of the displayed graph by various means 609 647 \item[detail view] displaying definition and statistical information for selected nodes 610 648 \item[statistics] a separate view on the data listing the statistical information for whole dataset in tables 649 \item[notifications] a widget to provide feedback about the system status to the user 611 650 \end{description} 612 651 … … 634 673 \item[profiles + datcats + datcats + groups + rr] 635 674 as above but again with profile-groups and relations 636 \item[ only profiles]675 \item[profiles similarity] 637 676 just profiles with links between them representing the degree of similarity based on the reuse of components and data categories 638 677 \end{description} … … 692 731 693 732 %%%%%%%%%%%%%%%%%%%%%%%%% 694 \section{Application of Schema Matchingtechniques in SMC}733 \section{Application of \emph{schema matching} techniques in SMC} 695 734 \label{sec:schema-matching-app} 696 735 697 736 Even though the described module is about ``semantic mapping'', until now we did not directly make use of the traditional ontology/schema mapping/alignment methods and tools as summarized in \ref{lit:schema-matching}. This is due 698 to the fact that thein this work we can harness the mechanisms of the semantic interoperability layer built into the core of the CMD Infrastructure, which integrates the task of identifying semantic correspondences directly into the process of schema creation,737 to the fact that in this work we can harness the mechanisms of the semantic interoperability layer built into the core of the CMD Infrastructure, which integrates the task of identifying semantic correspondences directly into the process of schema creation, 699 738 to a high degree obsoleting the need for a posteriori complex schema matching/mapping techniques. 700 Or put in terms of the schema matching methodology, the system relies on explicit ely set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.739 Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent. 701 740 702 741 However this is only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry. 703 742 704 743 Let us restate the problem of integrating existing external schemas as an application of \var{schema matching} method: 705 The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{ We talk of schema even though the creation (and also remodelling) takes place in the component registry by creating CMD profiles and components, because every profile has an unambiguous expression inXML Schema.} \var{$S_{1..n}$}.706 It is very unprobable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.707 Given the heterogen ity of the schemas present in the field of research, full alignments are not achievable at all.744 The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{Even though within CMDI the data models are called `profiles', we can still refer to them as `schema', because every profile has an unambiguous expression in a XML Schema.} \var{$S_{1..n}$}. 745 It is very improbable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}. 746 Given the heterogeneity of the schemas present in the field of research, full alignments are not achievable at all. 708 747 However thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the 709 748 components \var{c}. Thus the task is to find for every entity $e_{x} \in S_{x}$ the set of semantically equivalent candidate components $\{c_{y}\}$, which corresponds to the definitions of mapping function for single entities as defined in \cite{EhrigSure2004}. 710 Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, she is helped even with candidates that are not equivalent, thus we can further relax the task and allow even candidates that are just similar to a certain degree, that can be operationalized as threshold $t$ on the output of the \var{similarity} function749 Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function). 711 750 Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision. 712 751 … … 723 762 724 763 Next to the usual features and measures that can be applied like label equality or string-similarity and structural equality, 725 the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}. 726 727 It would be worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature. 728 longest matching subpath. 729 764 the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}. It would be also worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature (compute the longest matching subpath). 730 765 731 766 Although we examplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles, that though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}). 732 767 733 Note, that in the case of reuse of components, in the normal scenario, the semantic equivalenc y is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well, thus by defaultthe new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.768 Note, that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails. 734 769 735 770 The question is, what to do with the new correspondences that would possibly be determined, when, as proposed, we would apply the schema matching on the integrated schemas. One possibility is to add a data category, if one of the pair is still one missing. 736 771 However if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}). 737 772 738 Once all the equivalencies (and other relations) between the profiles/schemas were found, simliarity ratios can be determined. 739 This new simliarity ratios could be applied as alternative weights in the just-profiles graph \ref{sec:smc-cloud}. 740 741 In contrast to the task described here, that -- restricted matching XML schemas -- can be seen as staying in the ``XML World'', 742 another aspect within this work is clearly situated in the Semantic Web world and requires application of ontology matching methods, the mapping of field values to semantic entities described in \ref{sec:values2entities}. 743 773 Once all the equivalences (and other relations) between the profiles/schemas were found, simliarity ratios can be determined. 774 This new simliarity ratios could be applied as alternative weights in the profiles-similarity graph \ref{sec:smc-cloud}. 775 776 In contrast to the task described here, that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'', 777 another aspect within this work is clearly situated in the Semantic Web domain and requires application of ontology matching methods -- the mapping of field values to semantic entities described in \ref{sec:values2entities}. 744 778 745 779 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}. 780 746 781 747 782 -
SMC4LRT/chapters/Infrastructure.tex
r3671 r3776 99 99 100 100 As dictated by the CMD model, all components needed for the modelled resource description are compiled into one profile. 101 Once a profile is finished, the Component Registry provides automatically the corresponding XML schema in the \code{cmd} target namespace \code{http://www.clarin.eu/cmd}, that can be used as base for creating and validating metadata records.101 Once a profile is created, the Component Registry provides automatically the corresponding XML schema that can be used as base for creating and validating metadata records in the \code{cmd} namespace \code{http://www.clarin.eu/cmd}. 102 102 103 103 \subsubsection*{Ontological Relations -- Relation Registry} … … 110 110 111 111 There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}. 112 This implementation stores the individual relations as RDF triples 113 114 \begin{example3} 115 subjectDatcat & relationPredicate & objectDatcat 116 \end{example3} 117 118 allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications. 112 This implementation stores the individual relations as RDF triples allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications. 113 114 \begin{definition}{The relation triples as stored by the Relation Registry} 115 \textless \ subjectDatcat \ relationPredicate \ objectDatcat \textgreater 116 \end{definition} 119 117 120 118 \subsection{Further parts of the infrastructure} … … 142 140 143 141 144 \subsection{CMDI - Exploitation side}142 \subsection{CMDI exploitation side} 145 143 \label{cmdi_exploitation} 146 144 Metadata complying with the CMD data model is being created by a growing number of institutions by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints. These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications, that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}). … … 285 283 \lstset{language=XML} 286 284 \begin{lstlisting} 287 288 289 290 291 285 <dcif:conceptualDomain type="constrained"> 286 <dcif:dataType>string</dcif:dataType> 287 <dcif:ruleType>XML Schema regular expression</dcif:ruleType> 288 <dcif:rule>[a-z]{3}</dcif:rule> 289 </dcif:conceptualDomain> 292 290 \end{lstlisting} 293 291 … … 295 293 296 294 \begin{lstlisting} 297 295 <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/> 298 296 \end{lstlisting} 299 297 … … 319 317 <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType> 320 318 <dcif:rule> 321 <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/> 319 <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" 320 type="closed"/> 322 321 </dcif:rule> 323 322 </dcif:conceptualDomain> … … 359 358 %%%%%%%%%%%%%%%%% 360 359 \section{Other aspects of the infrastructure} 361 While this work concentrates solely on the metadata, it needs to be recognized, that it is only aspect of the infrastructure and its actual purpose the availability of resources. Metadata is a necessary first step to announce and describe the resources. However it is of little value, if the resources themselves are not accessible. Consequently, another pillar of the CLARIN infrastructure are the centres\furl{http://www.clarin.eu/node/3812}: 360 While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources. 361 362 \subsubsection{CLARIN Centres} 363 One view on the CLARIN infrastructure is that of a network of centres\furl{http://www.clarin.eu/node/3812}: 362 364 363 365 \begin{quotation} … … 368 370 CLARIN also maintains a central registry, the \xne{Centre Registry}\furl{https://centerregistry-clarin.esc.rzg.mpg.de/}, maintaining structured information about every centre, meant as primary entry point into the CLARIN network of centres. 369 371 370 One core service of such centres are the content repositories, systems meant for long-term preservation and publication of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties researchers (not just the home users) to store research data. 371 372 One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties researchers (not just the home users) to store research data. 373 374 \begin{comment} 372 375 In the following a few further well established repositories are mentioned. 373 376 … … 379 382 \item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \footnote{\url{http://www.openaire.eu/}} 380 383 \end{description} 381 384 \end{comment} 382 385 383 386 \begin{figure*} … … 389 392 \end{figure*} 390 393 391 Another aspect of the availability of resources is, that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, but mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs}\cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via the aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50. The maintenance of SRU/CQL has been 392 transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.) 393 394 \subsubsection{Federated Content Search} 395 396 Another aspect of the availability of resources is, that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, but mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}. 397 398 Note that in practice the line between metadata and content data is not so clear -- usually there is a need to filter by metadata even when searching in content. Therefore also most content search engines feature some kind of metadata filters. Thus it seems reasonable to harmonize the search protocol and query language for metadata and content. This proposition is further elaborated on in \ref{cql}. 394 399 395 400 \section{Summary} -
SMC4LRT/chapters/Introduction.tex
r3665 r3776 109 109 110 110 \section{Structure of the work} 111 The work starts with examining the state of the art work in the two fields language resources and technology and semantic web technologies in chapter \ref{ch:lit}, followed by administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work. 112 113 In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work. 111 The work starts with examining the state of the art work in the two fields language resources and technology and semantic web technologies in chapter \ref{ch:lit}. In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work. 114 112 115 113 The main part of the work is found in chapters \ref{ch:design} and \ref{ch:design-instance} laying out the design of the software module and a proposal how to model the data in RDF respectively. … … 118 116 The results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future. 119 117 118 The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref} and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}). 119 120 120 121 \section{Keywords} 121 122 -
SMC4LRT/chapters/Literature.tex
r3681 r3776 13 13 In recent years, multiple large-scale initiatives have set out to combat the fragmented nature of the language resources landscape in general and the metadata interoperability problems in particular. 14 14 15 \xne{EAGLES/ISLE Meta Data Initiative} (IMDI) \cite{wittenburg2000eagles} 2000 to 2003 proposed a standard for metadata descriptions of Multi-Media/Multi-Modal Language Resources aiming at easing access to Language Resources and thus increases their reusability.15 \xne{EAGLES/ISLE Meta Data Initiative} (IMDI)\furl{http://www.mpi.nl/imdi/} \cite{wittenburg2000eagles} 2000 to 2003 proposed a standard for metadata descriptions of Multi-Media/Multi-Modal Language Resources aiming at easing access to Language Resources and thus increases their reusability. 16 16 17 17 \xne{FLaReNet}\furl{http://www.flarenet.eu/} -- Fostering Language Resources Network -- running 2007 to 2010 concentrated rather on ``community and consensus building'' developing a common vision and mapping the field of LRT via survey. 18 18 19 \xne{CLARIN} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI) -- a comprehensive architecture for harmonized handling of metadata\cite{Broeder2011} --19 \xne{CLARIN}\furl{http://clarin.eu} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI) -- a comprehensive architecture for harmonized handling of metadata\cite{Broeder2011} -- 20 20 are the primary context of this work, therefore the description of this underlying infrastructure is detailed in separate chapter \ref{ch:infra}. 21 21 Both above-mentioned projects can be seen as predecessors to CLARIN, the IMDI metadata model being one starting point for the development of CMDI. … … 35 35 \label{lit:digi-lib} 36 36 37 In a broader view we should also regard the activities in the world of libraries.38 Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogs, they certainly have a long tradition, wealth of experience and stable solutions.39 40 Mainly driven by national libraries still bigger aggregations of the bibliographic data are being set up. 41 The biggest one being the \xne{Worldcat}\furl{http://www.worldcat.org/} (totalling 273.7 million records \cite{OCLCAnnualReport2012}) 42 powered by OCLC, a cooperative of over 72.000 libraries worldwide.43 44 In Europe, m ore recent initiatives have pursuit similar goals:37 In a broader view we should also regard the activities in the domain of libraries and information sciences (LIS). 38 Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogs, libraries were the early adopters and driving force in the field of search federation even before the era of internet (e.g. \xne{Linked Systems Project} \cite{Fenly1988}), the LIS community certainly has a long tradition, wealth of experience and robust solutions with respect to metadata aggregation and harmonization and exploitation. 39 %, starting collaborative efforts in mid 70s 40 41 Driven mainly by national libraries still bigger aggregations of the bibliographic data are being set up. 42 The biggest one is the \xne{Worldcat}\furl{http://www.worldcat.org/} (totalling 273.7 million records \cite{OCLCAnnualReport2012}) powered by OCLC, a cooperative of over 72.000 libraries worldwide. 43 44 In Europe, multiple recent initiatives have pursuit similar goals of pooling together the immense wealth of information sheltered in the many libraries: 45 45 \xne{The European Library}\furl{http://www.theeuropeanlibrary.org/tel4/} offers a search interface over more than 18 million digital items and almost 120 million bibliographic records from 48 National Libraries and leading European Research Libraries. 46 46 47 \xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} is a cultural heritage initiative with even broader scope, serving as ``meta-aggregator and portal for European digitised works'', encompassing material not just from libraries, but also museums, archives and all other kinds of collections (In fact, The European Library is the \emph{library aggregator} for Europeana).48 49 A large number of projects contribute(d) to Europeana. E.g. the auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, e.g.the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}.50 Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) a succession of \xne{Europeana} was established, a Best Practice Network, coordinated by The European Library, designed to establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research.51 52 The related catalogs and formats are described in the section \ref{sec: other-md-catalogs}47 \xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} is a cultural heritage initiative with even broader scope, serving as ``meta-aggregator and portal for European digitised works'', encompassing material not just from libraries, but also museums, archives and all other kinds of collections. (In fact, The European Library is the \emph{library aggregator} for Europeana.) 48 49 A large number of projects contribute(d) to \xne{Europeana}. E.g. the auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, one of them being the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}. 50 Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) another initiative in the realm of \xne{Europeana} has been started, a Best Practice Network, coordinated by The European Library, designed to ``establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research''. 51 52 The related catalogs and formats are described in the section \ref{sec:lib-formats}. 53 53 54 54 55 55 \section{Existing crosswalks (services)} 56 56 57 Crosswalks as list of equivalent fields from two schemas have been around already for a long time, in the world of enterprise systems, e.g. to bridge to legacy systems and also in libraries, e.g. \emph{MARC to Dublin Core Crosswalk}\furl{http://loc.gov/marc/marc2dc.html} 58 59 \cite{Day2002crosswalks} lists a number of mappings between metadata formats. 60 61 Mostly Dublin Core and MARC family of formats 62 63 http://www.loc.gov/marc/dccross.html 64 65 66 static 67 metadata crosswalk repository 68 69 70 OCLC launched \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118} 71 in particular \xne{Crosswalk Web Service}\furl{http://www.oclc.org/developer/services/metadata-crosswalk-service} 72 http://www.oclc.org/research/activities/xwalk.html 57 Crosswalks as list of equivalent fields from two schemas have been around already for a long time, in the world of enterprise systems, e.g. to bridge to legacy systems as well as in the LIS domain. \cite{Day2002crosswalks} lists a number of mappings between metadata formats, mostly betweeen Dublin Core and MARC families of formats.\footnote{\url{http://loc.gov/marc/marc2dc.html}, \url{http://www.loc.gov/marc/dccross.html}} 58 59 However, besides being restricted in terms of covered formats, these crosswalks are just static correspondence lists, often just available as documents and only limited coverage of formats. One effort, that comes nearer to our idea of a service delivering crosswalks dynamically is the \xne{Metadata Crosswalk Service}\footnote{\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}, \url{http://www.oclc.org/research/activities/xwalk.html}, (SOAP based)} offered by OCLC as part of \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118} 73 60 74 61 \begin{quotation} … … 76 63 \end{quotation} 77 64 78 the Crosswalk Web Service is now a production system that has been incorporated into the following OCLC products and services. 79 80 However the demo service is not available\furl{http://errol.oclc.org/schemaTrans.oclc.org.search} 81 82 83 84 Offered formats? 85 These however concentrate on the formats for the LIS community available and are ?? 86 87 For this service, a metadata format is defined as a triple of: 88 89 Standardâthe metadata standard of the record (e.g. MARC, DC, MODS, etc ...) 90 Structureâthe structure of how the metadata is expressed in the record (e.g. XML, RDF, ISO 2709, etc ...) 91 Encodingâthe character encoding of the metadata (e.g. MARC8, UTF-8, Windows 1251, etc ...) 92 93 94 Offered interface!? 95 he Crosswalk Web Service has 4 methods: 96 97 translate(...) - This method translates the records. See the documentation for more information. 98 getSupportedSourceRecordFormats() - This method returns a list of formats that are supported as input formats. 99 getSupportedTargetRecordFormats() - This method returns a list of formats that the input formats can be translated to. 100 getSupportedJavaEncodings() - Some formats will support all of the character encodings that Java supports. This function returns the list of encodings that Java supports. 101 65 Although the website states ``Crosswalk Web Service is now a production system that has been incorporated into OCLC products and services'', the demo service\furl{http://errol.oclc.org/schemaTrans.oclc.org.search} is not accessible. Also, this service only offers crosswalks between formats relevant for the LIS community: \xne{Dublin Core, MARCXML, MARC-2709, MODS}. So, altogether the service does not seem suitable to be used as is for the purposes of this work. But it certainly can serve as inspiration as for the specification of the planned service. 66 67 \begin{comment} 68 The Crosswalk Web Service has 4 methods: 69 \begin{description} 70 \item[translate()] This method translates the records. 71 \item[getSupportedSourceRecordFormats()] This method returns a list of formats that are supported as input formats. 72 \item[getSupportedTargetRecordFormats()] This method returns a list of formats that the input formats can be translated to. 73 \item[getSupportedJavaEncodings()] Some formats will support all of the character encodings that Java supports. This function returns the list of encodings that Java supports. 74 \end{description} 75 \end{comment} 102 76 103 77 … … 154 128 This elegant abstraction introduced with the \var{similarity} function provides a general model that can accomodate a broad range of comparison relationships and corresponding similarity measures. And here, again, we encounter a broad range of possible approaches. 155 129 156 \cite{ehrig2004qom} lists a number of basic features and corresponsing similarity measures: 157 Starting from primitive data types, next to value equality, string similarity, edit distance or in general relative distance can be computed. 158 For concepts, next to the directly applicable unambiguous \code{sameAs} statements, label similarity can be determined (again either as string similarity, but also broaded by employing external taxonomies and other semantic resources like WordNet - \emph{extensional} methods), equal (shared) class instances, shared superclasses, subclasses, properties. 159 160 Element-level (terminological) vs structure-level (structural) \cite{Shvaiko2005_classification} 161 162 based on background knowledge... 163 164 subclassâsuperclass relationships, domains and ranges of properties, analysis of the graph structure of the ontology. 165 166 For properties the degree of the super an subproperties equality, overlapping domain and/or range. 167 Additionally to these measures applicable on individual ontology items, there are approaches (like the \var{Similarity Flooding algorithm} \cite{melnik2002similarity}) to propagate computed similarities across the graph defined by relations between entities (primarily subsumption hierarchy). 130 \cite{ehrig2004qom} lists a number of basic features and corresponding similarity measures, \cite{Shvaiko2005_classification} classifies the features into element-level (terminological), structure-level (structural) and based on background knowledge (extensional): 131 Starting from primitive data types, next to value equality, string similarity, edit distance or in general relative distance can be computed. For concepts, besides the directly applicable unambiguous \code{sameAs} statements, label similarity can be determined (again, either as string similarity, but also by employing external taxonomies and other semantic resources like WordNet -- \emph{extensional} methods), equal (shared) class instances, subclassâsuperclass relationships, shared properties. For properties the degree of the super an subproperties equality, overlapping domain and/or range. 132 133 Additionally to these measures applicable on individual ontology items, there are approaches (like the \var{Similarity Flooding algorithm} \cite{melnik2002similarity}) to propagate computed similarities across the graph defined by relations between entities (primarily subsumption hierarchy), or even to analyse and compare the overall graph structure of the ontology. 168 134 169 135 \cite{Algergawy2010} classifies, reviews, and experimentally compares major methods of element similarity measures and their combinations. \cite{shvaiko2012ontology} comparing a number of recent systems finds that ``semantic and extensional methods are still rarely employed. In fact, most of the approaches are quite often based only on terminological and structural methods. … … 189 155 A number of existing systems for schema/ontology matching/alignment is collected in the above-mentioned overview publications: 190 156 191 IF-Map \cite{kalfoglou2003if}, QOM \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, Similarity Flooding (SF) \cite{melnik}, S-Match \cite{Giunchiglia2007_semanticmatching}, the Prompt tools \cite{Noy2003_theprompt} integrating with Protégéor \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.157 \xne{IF-Map} \cite{kalfoglou2003if}, \xne{QOM} \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, \xne{Similarity Flooding (SF)} \cite{melnik}, \xne{S-Match} \cite{Giunchiglia2007_semanticmatching}, the \xne{Prompt} tools \cite{Noy2003_theprompt} integrating with \xne{Protégé} or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}. 192 158 193 159 All of the tools use multiple methods as described in the previous section, exploiting both element as well as structural features and applying some kind of composition or aggregation of the computed atomic measures, to arrive to a alignment assertion. … … 206 172 207 173 \subsubsection{Semantic Web - Technical solutions / Server applications} 208 209 210 The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently 211 and idealiter expose them via a web interface to the users. 212 213 Meanwhile a number of RDF triple store solutions relying both on native, DBMS-backed or hybrid persistence layer are available, open-source solutions like \xne{Jena, Sesame} or \xne{BigData} as well as a number of commercial solutions \xne{AllegroGraph, OWLIM, Virtuoso}. 174 \label{semweb-tech} 175 176 The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently via SPARQL\cite{SPARQL2008} and \textit{idealiter} expose them via a web interface to the users. 177 178 Meanwhile a number of RDF triple store solutions relying both on native, DBMS-backed or hybrid persistence layer are available, open-source solutions like \xne{Jena, Sesame} or \xne{BigData} as well as commercial solutions \xne{AllegroGraph, OWLIM, Virtuoso}. 214 179 215 180 A qualitative and quantitative study\cite{Haslhofer2011europeana} in the context of Europeana evaluated a number of RDF stores (using the whole Europeana EDM data set = 382,629,063 triples as data load) and came to the conclusion, that ``certain RDF stores, notably OpenLink Virtuoso and 4Store'' can handle the large test dataset. 216 181 217 182 \xne{OpenLink Virtuoso Universal Server}\furl{http://virtuoso.openlinksw.com} is hybrid storage solution for a range of data models, including relational data, RDF and XML, and free text documents.\cite{Erling2009Virtuoso, Haslhofer2011europeana} 218 Virtuoso is used to host many important Linked Data sets (e.g., DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}).183 Virtuoso is used to host many important Linked Data sets, e.g., DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}. 219 184 Virtuoso is offered both as commercial and open-source version license models exist. 220 185 221 186 Another solution worth examining is the \xne{Linked Media Framework}\furl{http://code.google.com/p/lmf/} -- ``easy-to-setup server application that bundles together three Apache open source projects to offer some advanced services for linked media management'': publishing legacy data as linked data, semantic search by enriching data with content from the Linked Data Cloud, using SKOS thesaurus for information extraction. 222 187 223 One more specific work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching. 224 188 One more specific work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching. Another solution in a related, more specialized domain and already in productive use is \xne{rechercheisidore}\furl{http://rechercheisidore.fr} \cite{pouyllau2011isidore}, a french portal for digital humanities resources. 225 189 226 190 \begin{comment} … … 231 195 232 196 Haystack\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)} 197 198 \todoin{check SARQ}\furl{http.//github.com/castagna/SARQ} 199 233 200 \end{comment} 234 201 235 202 \subsubsection{Ontology Visualization} 236 203 237 Landscape, Treemap, SOM 238 239 \todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf} 204 The complex structured datasets like ontologies require dedicated means for their high-level exploration, like aggregations and interactive visualization techniques. A large variety of solutions has been implemented in the last two decades (cf. overview of the field in \cite{lanzenberger2010ontology}, also for citations of tools listed below). Given the inherent graph structure of the RDF data model, the obvious and most common approach is a tree- or graph-based visualization with concepts being represented as nodes and relations as edges. Numerous solutions are realized as plug-ins for the wide-spread open-source ontology editor \xne{Prot\'{e}g\'{e}} \cite{grosso1999protege} developed at Stanford University, like \xne{OntoViz, Jambalaya, TouchGraph, OWLViz, OntoSphere, PromptViz} etc. 205 206 There exists also a sizable number of stand-alone solutions (\xne{Ontorama, FOAFnaut, IsaViz, GKB-Editor} and more) though often bound to a specific dataset or data type (\xne{Wordnet, FOAF, Cyc}). 207 208 There is also plenty of general graph visualization tools, that can be adopted for viewing the RDF data as graph, like the traditional graph layouting tool \xne{GraphViz dot}, or more recently \xne{Gephi} \cite{bastian2009gephi}, a stand-alone interactive tool for graph visualization with a number of layouting algorithms and display options. A rather recent generic visualization javascript library \xne{d3}\footnote{\url{http://d3js.org}} % \cite{bostock2011d3} seems especially appealing thanks to its data-driven paradigm, dedicated support for graphs with integrated customizable graph layouting algorithm and -- being pure javascript -- allowing web-based solutions. 209 210 %Most recently a web-based version of this versatile tool has been released\furl{http://protegewiki.stanford.edu/wiki/WebProtege} that supports collaborative ontology development 211 212 The solutions are rather sparse when it comes to more advanced visualizations, beyond the simple one to one display of the data model graph as a visual graph, especially the visualization of ontology mapping and alignment. Besides \xne{OLA} \cite{euzenat2004ola}, \xne{PromptViz} \cite{Noy2003_theprompt} and \xne{CogZ} \cite{falconer2009cogz} we would like to point out one solution developed at the IFS of the Technical University in Vienna \cite{lanzenberger2006alviz}, \xne{AlViz} -- a tool which visually supports semi-automatic alignment of ontologies. It is implemented as a ``multiple-view plug-in for Protege using J-Trees and Graphs. Based on similarity measures of an ontology matching algorithm AlViz helps to assess and optimize the alignment results.'' It applies visual clues like colouring to indicate the computed similarity of concepts between two ontologies and clustering for reducing the complexity of the displayed datasets (cf. figure \ref{fig:alviz}). Unfortunately, the development of this very promising research prototype seems to have stalled, the last available version being from 2009. 213 214 \begin{figure*} 215 \begin{center} 216 \includegraphics[width=0.8\textwidth]{images/AlViz_screenshot.png} 217 \caption{Screenshot of AlViz -- tool for visual exploration of ontology alignment \cite{lanzenberger2006alviz}} 218 \label{fig:alviz} 219 \end{center} 220 \end{figure*} 240 221 241 222 … … 243 224 \section{Language and Ontologies} 244 225 245 There are two different relation links betwee language or linguistics and ontologies: a) `linguistic ontologies' domain ontologies conceptualizing the linguistic domain, capturing aspects of linguistic resources; b) `lexicalized' ontologies, where ontology entities are enriched with linguistic, lexical information.226 There are two different relation links between language or linguistics and ontologies: a) `linguistic ontologies' domain ontologies conceptualizing the linguistic domain, capturing aspects of linguistic resources; b) `lexicalized' ontologies, where ontology entities are enriched with linguistic, lexical information. 246 227 247 228 \subsubsection{Linguistic ontologies} … … 270 251 Another indication of the heritage is the fact that concepts of the GOLD ontology were migrated into ISOcat (495 items) in 2010. 271 252 272 Notice that although this work is concerned with language resources, it is primarily on the metadata level, thus the overlap with linguistic ontologies codifying the terminology of the discipline linguisticis rather marginal (perhaps on level of description of specific linguistic aspects of given resources).253 Notice that although this work is concerned with language resources, it is primarily on the metadata level, thus the overlap with linguistic ontologies codifying the discipline specific linguistic terminology is rather marginal (perhaps on level of description of specific linguistic aspects of given resources). 273 254 274 255 \subsubsection{Lexicalised ontologies,``ontologized'' lexicons} 275 276 256 277 257 The other type of relation between ontologies and linguistics or language are lexicalised ontologies. Hirst \cite{Hirst2009} elaborates on the differences between ontology and lexicon and the possibility to reuse lexicons for development of ontologies. -
SMC4LRT/chapters/Results.tex
r3681 r3776 31 31 \\ 32 32 33 \url{http://clarin.a ac.ac.at/smc} (soon: \url{http://acdh.ac.at/smc})33 \url{http://clarin.arz.oeaw.ac.at/smc} (soon: \url{http://acdh.ac.at/smc}) 34 34 35 35 … … 41 41 This interface is available as part of the smc application: 42 42 43 \url{http://clarin.a ac.ac.at/smc/cx}43 \url{http://clarin.arz.oeaw.ac.at/smc/cx} 44 44 45 45 \subsection{SMC - as a module within Metadata Repository} 46 46 The SMC is also integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain. 47 47 48 \url{http://clarin.a ac.ac.at/mdrepo/smc}48 \url{http://clarin.arz.oeaw.ac.at/mdrepo/} (module not integrated yet ) 49 49 50 50 \subsection{SMC Browser -- advanced interactive user interface} … … 52 52 SMC Browser is an advanced web-based visualization application to explore the complex dataset of the \xne{Component Metadata Infrastructure}, by visualizing its structure as an interactive graph. In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation. Details about design and implementation can be found in \ref{smc-browser}. The publicly available instance is maintained under: 53 53 54 \url{http://clarin.a ac.ac.at/smc/browser}54 \url{http://clarin.arz.oeaw.ac.at/smc-browser} 55 55 56 56 \begin{figure*} … … 287 287 \begin{figure*}[!ht] 288 288 \begin{center} 289 \includegraphics[width=1\textwidth]{images/just_profiles_ 6.png}289 \includegraphics[width=1\textwidth]{images/just_profiles_9.png} 290 290 \end{center} 291 291 \caption{SMC cloud -- graph visualizing the semantic proximity of profiles} -
SMC4LRT/chapters/acknowledgements.tex
r2697 r3776 1 1 \chapter*{Acknowledgements} 2 2 3 I would like to thank all the colleagues from the CLARIN community, for the support, the fruitful discussions and helpful feedback, especially Daan Broeder, Menzo Windhouwer, Marc Kemps-Snijders, Hennie Brugman. 4 3 I would like to thank all the colleagues from my institute and from the CLARIN community, for the support, the fruitful discussions and helpful feedback, especially Menzo Windhouwer, Daan Broeder, Dieter Van Uytvanck, Marc Kemps-Snijders and Hennie Brugman. \\ 4 And to all my dear one, for the extra portion of patience I demanded from them 5 \\ 6 \\ 5 7 With love to em. -
SMC4LRT/chapters/appendix.tex
r3665 r3776 5 5 6 6 \chapter{Data model reference} 7 \label{ch:data-model-ref} 7 8 In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model}, \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture, that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC. 8 9 … … 37 38 38 39 \chapter{CMD -- sample data} 40 \label{ch:cmd-sample} 39 41 40 42 \section{Definition of a CMD profile} 43 Following listing presents a sample CMD specification for the \concept{collection\#clarin.eu:cr1:p\_1345561703620} profile. 44 45 \input{chapters/collection_spec.xml.tex} 41 46 42 47 \section{CMD record} 48 Following listing represents a sample CMD record - an instance of the \concept{collection} profile listed above. 49 50 \input{chapters/collection_instance.xml.tex} 43 51 44 52 45 \chapter{SMC Browser -- related material } 53 \chapter{SMC -- documentation} 54 \label{ch:smc-docs} 46 55 56 \begin{figure*} 57 \begin{center} 58 \includegraphics[height=1\textwidth, angle=90]{images/build_init.png} 59 \end{center} 60 \caption{A graphical representation of the dependencies and calls in the main \xne{ant} build file.} 61 \label{fig:smc-build_init} 62 \end{figure*} 47 63 48 \begin{figure*}[!ht] 49 \begin{center} 50 \includegraphics[width=1\textwidth]{images/cmd-deps-graph_part2.png} 51 \end{center} 52 \caption{An early version of a visual representation of (a part of) the \xne{smc-graph} generated with the \code{dot} tool.} 53 \label{fig:cmd-dep-dotgraph} 54 \end{figure*} 64 \section{Documentation of smc-xsl} 65 \label{sec:smc-xsl-docs} 66 \todoin{generate and reference XSLT-documentation} 55 67 56 68 \section{SMC Browser user documentation} … … 62 74 \label{sec:smc-graphs} 63 75 76 \begin{figure*}[h] 77 \begin{center} 78 \includegraphics[width=1\textwidth]{images/cmd-deps-graph_part2.png} 79 \end{center} 80 \caption{An early version of a visual representation of (a part of) the \xne{smc-graph} generated with the \code{dot} tool.} 81 \label{fig:cmd-dep-dotgraph} 82 \end{figure*} 83 84 64 85 \begin{comment} 65 86 66 87 \chapter{SMC Reports} 67 \label{ch:smc-reports}88 %\%label{ch:smc-reports} 68 89 69 90 SMC Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}. -
SMC4LRT/chapters/userdocs_cleaned.tex
r3666 r3776 99 99 The nodes are colour-coded by type: 100 100 101 \includegraphics[height=100px]{ C:/Users/m/3/clarin/_repo/SMC/docs/graph_legend.svg}101 \includegraphics[height=100px]{images/graph_legend.png} 102 102 103 103 \phantomsection\label{select-nodes} -
SMC4LRT/images/Terms.xsd.tex
r3640 r3776 2 2 \begin{lstlisting}[label=lst:terms-schema, caption=Terms.xsd -- schema of the internal data model \ref{datamodel-terms}] 3 3 <?xml version="1.0" encoding="UTF-8"?> 4 <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" xmlns:ns2="http://www.w3.org/1999/xlink"> 4 <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 5 elementFormDefault="qualified" xmlns:ns2="http://www.w3.org/1999/xlink"> 5 6 <xs:import namespace="http://www.w3.org/1999/xlink" schemaLocation="ns2.xsd"/> 6 7 <xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="xml.xsd"/> -
SMC4LRT/thesis.tex
r3666 r3776 28 28 \thesisverfassung{Matej \v{D}ur\v{c}o} % Verfasser 29 29 \thesisauthor{Matej \v{D}ur\v{c}o} % your name 30 \thesisauthoraddress{ Viktorgasse 8/6, 1040 Wien} % your address30 \thesisauthoraddress{JosefstÀdterstrasse 70/32, 1080 Wien} % your address 31 31 \thesismatrikelno{0005416} % your registration number 32 32 33 \thesisbetreins{ao.Univ.-Prof. ?? Dr. Andreas Rauber}34 \thesisbetrzwei{ Univ.-Prof. Mag. Dr. Gerhard Budin}33 \thesisbetreins{ao.Univ.-Prof. Dr. Andreas Rauber, Univ.-Prof. Mag. Dr. Gerhard Budin} 34 \thesisbetrzwei{} 35 35 %\thesisbetrdrei{Dr. Vorname Familienname} % optional 36 36 … … 58 58 59 59 60 %\begin{comment} 60 \begin{comment} 61 \end{comment} 61 62 \input{chapters/Introduction} 62 63 63 64 \input{chapters/Literature} 64 65 65 \input{chapters/Definitions}66 67 68 66 \input{chapters/Data} 69 67 70 68 \input{chapters/Infrastructure} 69 71 70 \input{chapters/Design_SMCschema} 72 73 74 71 75 72 \input{chapters/Design_SMCinstance} 76 73 77 74 \input{chapters/Results} 78 %\end{comment} 75 79 76 \input{chapters/Conclusion} 80 77 … … 90 87 %\bibliography{references} 91 88 %\bibliographystyle{ieeetr} 92 \bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb,../../2bib/distributed_systems,../../2bib/own }89 \bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb,../../2bib/distributed_systems,../../2bib/own,../../2bib/diglib,../../2bib/it-misc,../../2bib/infovis} 93 90 94 91 \appendix 92 93 \input{chapters/Definitions} 95 94 96 95 \input{chapters/appendix}
Note: See TracChangeset
for help on using the changeset viewer.