image/svg+xml
Masterstudium:
Information & Knowledge Management
Diplomarbeitspräsentation
Semantic Mapping Component for Language Resources and Technology
Matej Ďurčo
Technische Universität Wien
Institute of Software Technology and Interactive Systems
Arbeitsbereich: Information & Software Engineering Group
Betreuer: Ao.Univ.-Prof. Dr. Andreas Rauber
Schriftart: Arial
Schrift: 42pt, linksbündig
Schrift: 42pt, mittig
Schrift: 54pt, mittig
Schrift: 42pt, mittig
Schrift: 30pt, rechtsbündig
Kontakt: viktor.vorzeigestudent@schreibmir.at
Schrift: 30pt, rechtsbündig
Maße:
DIN A0 (841x1189mm)
Header: 190 x 815mm, Rahmen 3pt
Abstand zum oberen und unteren Rand: 12,7mm
Logos: 52 x 195,5 mm
Diese Anmerkungen liegen auf einem eigenen Layer....
SMC4LRT
This work is embedded in the context of a large research infrastructure initiative aimed at providing easy, stable, harmonized access to language resources and technology (LRT) in Europe, the Common Language Resource and Technology Infrastructure (CLARIN1) . A core technical pillar of this initiative is the Component Metadata Infrastructure (CMDI2), a distributed system for creating and providing metadata for LRT in a coherent harmonized way. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. Consequently, the infrastructure has also foreseen a dedicated module, the Semantic Mapping Component that harenesses these mechanisms to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions. CMDI provides a flexible meta-model for creation of metadata schemas, allowing to bind the semantics of individual metadata fields to well defined concepts – data categories – collaborativelly maintained by the community in a dedicated registry (ISOcat3).The ultimate objective was to enhance search functionality over a large heterogeneous collection of metadata about language resources. This objective was pursued in two separate, complementary approaches: a) design a service delivering crosswalks (i.e. equivalences between fields in disparate metadata formats) based on well-defined concepts and apply this concept-based crosswalks in search scenarios to enhance recall. b) acknowledging the integrative power of the Linked Open Data paradigm, express the domain data as a Semantic Web resource, to enable the application of semantic technologies on the dataset.The need for enhanced ways of exploring the complex growing CMD data domain lead to the development of the SMC Browser, a web application visualizing the CMD entities (profiles/schemas, components, elements and data categories) as an interactive graph. In particular, the tool enables the metadata modeller to examine the reuse of components or data categories in different profiles. The graph is accompanied by statistical information, e.g. counting how many elements a profile contains, or in how many profiles given data category is used.Given the size of the processed dataset (over 4.600 nodes and 7500 edges) it is not feasible to display all of the data at once. Rather, the user can start selecting nodes of interest and use navigation options to investigate the surrounding subgraph.The SMC module is part of the CMD Infrastructure. It is a consumer of data from the production side registries (Component Registry4, ISOcat DCR3 and RELcat5) and serves search services on the exploitation side of the infrastructure, as well as third party applications accessing the joint CLARIN metadata domain. The SMC is designed to enable semantic interoperability primarily on schema level. However, the problem of different labels for semantically equivalent entities is even more so virulent in the metadata fields on the instance level. Therefore, this work additionally proposes a mechanism for using reference data (such as controlled vocabularies, taxonomies, ontologies) to harmonize the values in the metadata fields of the CMD records. This mechanism is furthermore embedded in a more general effort to express the whole of the CMD data domain (model and instances) in RDF constituting one large semantic resources with outbound links to other existing external datasets. This effort lays a foundation for providing the original dataset as a Linked Open Data nucleus within the Web of Data as well as for real semantic (ontology-driven) search and exploration of the data.In this work, a fundament was created for semantically grounded search and exploration of the CMDI data domain. In the future, more complex processing and responses (similarity ratio, relation types) of the crosswalks service are planned with implications for the attached query expansion module. Also, the integration of the semantic mapping features in the search user interface is only rudimentary at present, calling for a more elaborate solution.The auxiliary visualization application SMC Browser provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset. As such, the tool and the analyses can be considered a valuable contribution to the community. Nevertheless, a number of new features are planned: integration with instance data, set operations on subgraphs, generalization of the matching algorithm and many more. Finally, a lot of work is needed to further detail and operationalize the proposed transformation of the CMD dataset into a RDF-based semantic resource and make it available as Linked Open Data.The SMC module consists of following components: crosswalk service the basic service translating between fields (or indexes) concept-based query expansion a module for query expansion based on the crosswalks smc-xsl set of xslt-stylesheets for pre- and post-processing the data SMC Browser a web application to explore the CMD data domain consisting of smc-stats – statistical summaries of the CMD data domain smc-graph – advanced interactive graph-based user interface
Main results of this work are: a) the specification of the module for concept-based search together with the underlying crosswalks service accompanied by a proof-of-concept implementation b) blueprint for expressing the original dataset in RDF format, as a foundation for providing this dataset as Linked Open Data c) SMC Browser – an interactive web application for exploring the schema level of CMD data. Its numerical and visual output is base for SMC reports - a growing set of documents analyzing specific phenomena in the CMD data domain. Parts of this work have been already presented at various occasions – international workshops in the context of the research infrastructures CLARIN and DARIAH as well as on conferences:Ďurčo, M.; Broeder, D. & Windhouwer, M. Semantic Mapping - groundwork for query expansion and semantic search. In Proceedings of the Metadata 2012 Workshop on Describing Language Resources with Metadata: Towards Flexibility and Interoperability in the Documentation of Language Resources, Istanbul, LREC 2012Ďurčo, M. & Windhouwer, M. Semantic Mapping in CLARIN Component Metadata. In 7th Metadata and Semantics Research Conference (MTSR 2013), Thessaloniki, 2013Ďurčo, M. & Mörth, K. Controlled Vocabularies for Digital Humanities. In Digitale Bibliothek - Kulturelles Erbe in der Cloud, Graz, 2013A special type of graph (still experimental), visualizing the semantic proximity of different schemas (based on the ratio of shared data categories)A high-level overview of the reuse of a few very common data categories by different schemas(Or is it rather some mythical creature? ;)The visualizations are meant primarily for illustration of the possibilities. Due to the complexity of the underlying data such snapshots are of limited use for real investigation of the dataset. An interactive interface as provided by the SMC Browser is needed for a indepth exploration.The reuse of the four different data categories for describing the language of a resource(graph manually adjusted)1 http://www.clarin.eu; 2 http://www.clarin.eu/cmdi; 3 http://www.isocat.org; 4 http://catalog.clarin.eu/ds/ComponentRegistry/; 5 http://lux13.mpi.nl/relcat/site/index.htmlSMC Browser
Results
System Architecture
Objective and Method
Context
Semantic Entity Resolution and Linked Open Data
Conclusions and Outlook
Conceptual space
http://clarin.oeaw.ac.at/smc/
SMC
SMC Browser
crosswalk
service
smc-xsl
smc-
graph
smc-
stats
graph-widget
CMDI
Component
Registry
ISOcat
RELcat
concept-based
query expansion
cx
qx
schema level
ontology-drivenexploration
concept-basedcrosswalks
semantic interpretation/entity resolution
instance level
semantic mapping
semantic search
concept-based query expansion
advanced interactive data visualization/exploration
ISOcat
DataCategory1
Metadata Description A1[CMDI]
<ComponentP> <ElementX>valueX</ElementX> <ElementY>valueY</ElementY></ComponentP>
Metadata Description B1[CMDI]
<ComponentQ> <ElementW>valueW</ElementW><ComponentR> <ElementZ>valueZ</ElementZ>
RelationReg
set I
#Ontology1concept1aconcept1bconcept1c
DataCategory3
DataCategory2
set II
...
ProfileA [CMD]
ElementX
ElementY
ComponentP
DC2 ≤ DC3
set III
DC3 = concept1
ProfileB [CMD]
ElementW
ComponentQ
Component Registry
MD Repository
DC1 = DC2
SchemaA
SchemaB
dublincore
DataCategory4
...
ElementZ
ComponentR
DCRs
Univ.-Prof. Mag. Dr. Gerhard Budin