Changeset 3233 for SMC4LRT


Ignore:
Timestamp:
08/05/13 13:24:30 (11 years ago)
Author:
vronk
Message:

restructuring, moved Implementation to Design and Results

Location:
SMC4LRT/chapters
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/CMD2RDF.tex

    r3204 r3233  
    1 
    2 \chapter{CMD to RDF}
     1\chapter{Design - Mapping on instance level}
     2
     3
     4\subsection{Linked Data - Express dataset in RDF}
     5
     6
     7I do think that ISOcat, CLAVAS, RELcat, an actual language
     8resource all provide a part of the semantic network.
     9
     10And if you can express these all in RDF, which we can for almost all of them (maybe
     11except the actual language resource ... unless it has a schema adorned
     12with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for
     13metadata we have that in the CMDI profiles ...) you could load all the
     14relevant parts in a triple store and do your SPARQL/reasoning on it. Well
     15that's where I'm ultimately heading with all these registries related to
     16semantic interoperability ... I hope ;-)
     17\todocite{Menzo}
     18
     19
     20Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
     21So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud.
     22
     23
     24Technical aspects (RDF-store?) / interface (ontology browser?)
     25
     26\todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
     27
     28\todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
     29
     30defining the Mapping:
     31\begin{enumerate}
     32\item convert to RDF
     33translate: MDRecord $\rightarrow$ [\#mdrecord \#property literal]
     34\item map: \#mdrecord \#property literal  $\rightarrow$ [\#mdrecord \#property \#entity]
     35\end{enumerate}
     36
     37\begin{figure*}[!ht]
     38\includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
     39\caption{The process of transforming the CMD metadata records to and RDF representation}
     40\label{fig:smc_cmd2lod}
     41\end{figure*}
     42
     43
     44\section{CMD to RDF}
     45\label{ch:cmd2rdf}
    346
    447A few modules/components of the CMD infrastructure are dedicated to semantic interoperability. The DCR as global registry for concepts, CLAVAS for maintaining controlled vocabularies in SKOS format, RR for expressing arbitrary relations between concepts.
    548However, the actual values in the CMD instances are ``just strings'' and for the most part cannot be validated by the schema, although they often could be mapped to a corresponding controlled vocabulary.
    649
    7 Thus one aim of this work is to express the whole of the CMD data (model and instances) in RDF. This would allow to map the string values in selected fields to semantic entities, which in turn would allow real semantic search and bring about a linking with the web of data \todocite{Web of Data, TimBL}
     50Thus one aim of this work is to express the whole of the CMD data (model and instances) in RDF. This would allow to map the string values in selected fields to semantic entities, which in turn would allow real semantic (ontology-driven) search and bring about a linking with the web of data \todocite{Web of Data, TimBL}
    851
    952The following chapter lays out, how individual parts of the CMD framework can be expressed in RDF
    1053
    11 \section{CMD specification}
     54\subsection{CMD specification}
    1255The meta model
    1356
     
    3477\end{note}
    3578
    36 \section{Data Categories}
     79\subsection{Data Categories}
    3780Windhouwer (2012) proposes to use the data categories as annotation properties.
    3881Definition of the annotation property \code{dcr:datcat}
     
    93136
    94137
    95 \section{CMD instances}
    96 
    97 
    98 \subsection {Resource Identifier}
     138\subsection{CMD instances}
     139
     140
     141\subsubsection {Resource Identifier}
    99142
    100143It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID . Alternatively we could use the PID of the MD record ( \code{<lr1.cmd>}  from \code{<cmd:MdSelfLink>}) as the resource identifier.
     
    126169\end{example}
    127170
    128 \subsection{Hierarchy ( Resource Proxy – IsPartOf)}
    129 In CMD, <cmd:ResourceProxyList> is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modeled as OAI-ORE Aggregation:
     171\subsubsection{Hierarchy ( Resource Proxy – IsPartOf)}
     172In CMD, <cmd:ResourceProxyList> is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modeled as OAI-ORE Aggregation\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
     173\furl{http://openannotation.org/spec/core/core.html\#Motivations}
     174:
    130175
    131176\begin{example}
     
    151196
    152197       
    153 \subsection{Components – nested structures}
     198\subsubsection{Components – nested structures}
    154199
    155200\begin{note}
     
    172217\end{example}
    173218
    174 \subsection{Elements, Fields, Values}
     219\subsubsection{Elements, Fields, Values}
    175220
    176221There are two steps to the modeling of the actual values in the fields of CMD records in RDF. The first one is to express the values as triples with literal values, then for selected fields – using the literal values – try to find corresponding entities in appropriate controlled vocabularies and generate new triples.
     
    182227\end{example}
    183228
    184 \subsubsection{Literal Values}
     229%\subsubsection{Literal Values}
     230\paragraph{Literal Values}
    185231
    186232Usually, RDF-mapping of dublincore descriptions is to data properties (cf. OLAC-DcmiTerms profile )
     
    207253This raises the vice-versa question, whether to rather handle all data categories uniformly, thus encoding dublincore terms also as annotation properties.
    208254
    209 \subsubsection{Mapping to entities – Vocabularies  – CLAVAS}
     255%\subsubsection{Mapping to entities – Vocabularies  – CLAVAS}
     256\paragraph{Mapping to entities – Vocabularies  – CLAVAS}
     257
    210258A major (if not the main) motivation for the CMD to RDF mapping is the wish to have better control over  and better quality of values in metadata fields with constrained value domain like organization or resource type. As the allowed values for these fields often cannot be explicitly enumerated, it is not possible to restrict them by means of an XML schema. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.)
    211259Thus, one goal of this work is to map (string) values in selected fields to entities defined in corresponding vocabularies. The main provider of relevant vocabularies is ISOcat and CLAVAS  – a service for managing and providing vocabularies in SKOS format. Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that for our purposes we can assume OpenSKOS as the one source of vocabularies.
     
    228276<org1>   dcterms:identifier <org1>, <dbpedia/org1>, <lt-world/orgx>;
    229277\end{example}
     278
     279
     280
     281\paragraph{Mapping from strings to Entities}
     282
     283Find matching entities in selected Ontologies based on the textual values in the metadata records.
     284
     285
     286Identify related ontologies:
     287LT-World \cite{Joerg2010}
     288
     289task:
     290\begin{enumerate}
     291\item  express MDRecords in RDF
     292\item  identify related ontologies/vocabularies (category $\rightarrow$ vocabulary)
     293\item  use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
     294
     295%\fbox{ function lookup: Category x String -> ConceptualDomain}
     296\begin{eqnarray*}
     297lookup(Category, Literal) \rightarrow ConceptualDomain??
     298\end{eqnarray*}
     299
     300
     301Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
     302\end{enumerate}
     303
     304
    230305
    231306\section{RELcat - Ontological relations}
     
    260335
    261336
    262 \section{References}
    263 
    264 Schuurman, I. \& Windhouwer., M.  Explicit Semantics for Enriched Documents. What Do ISOcat, RELcat and SCHEMAcat Have To Offer? 2nd Supporting Digital Humanities conference (SDH 2011), 17-18 November 2011, Copenhagen, Denmark, 2011
    265 Windhouwer, M. \& Wright, S. E. Linking to linguistic data categories in ISOcat Linked Data in Linguistics, Springer, 2012, 99-107
    266 
    267 \furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
    268 \furl{http://openannotation.org/spec/core/core.html\#Motivations}
    269 
    270 
    271 
    272 
     337
     338\section{SMC LOD}
     339
     340\todoin{read: Europeana RDF Store Report}
     341
     342\todocode{install Jena +  fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site}
     343
     344\todocode{install older python (2.5?) to be able to install dot2tex - transforming dot files to nicer pgf formatted graphs}\furl{http://dot2tex.googlecode.com/files/dot2tex-2.8.7.zip}\furl{file:/C:/Users/m/2kb/tex/dot2tex-2.8.7/}
     345
     346
     347\todocode{check install siren}\furl{http://siren.sindice.com/}
     348\todocode{check install Virtuoso}\furl{http://ods.openlinksw.com/wiki/ODS/}
     349\todocode{check install Neo4J}
     350\todocode{check install ontology browser}
     351
     352semantic search component in the Linked Media Framework
     353\todocode{!!! check install LMF - kiwi - SemanticSearch !!!}\furl{http://code.google.com/p/kiwi/wiki/SemanticSearch}
     354
     355\todoin{check SARQ}\furl{http://github.com/castagna/SARQ}
     356
     357\todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?}
     358
     359
     360
     361
     362
  • SMC4LRT/chapters/Design.tex

    r3204 r3233  
    22\chapter{Semantic Mapping Component - Design}
    33\label{ch:design}
     4
     5\section{System Architecture}
     6
     7The Semantic Mapping module is based on the DCR and CMD framework and is being developed as a separate service on the side of CLARIN  Metadata Service, its primary consuming service, but shall be equally usable by other applications.
     8
     9
     10\todoin{appendix: reference architecture}
     11
    412
    513\section{Data Model?}
     
    6371
    6472
    65 \section{Semantic Mapping on concept level}
     73\section{Crosswalks -- Mapping on schema level}
    6674
    6775merging the pieces of information provided by those,
     
    118126
    119127
    120 \subsection*{Extensions}
     128\subsection{Extensions}
    121129
    122130A useful supplementary function of the module would be to provide a list of existing indexes.
     
    127135Also, use of \emph{other than equivalency relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the SMC, either returning the relation types themselves as well or equip the list of indexes with some similarity ratio.}
    128136
    129 
    130 
    131 \section{Semantic Mapping on instance level}
    132 
    133 
    134 \subsection{Mapping from strings to Entities}
    135 
    136 Find matching entities in selected Ontologies based on the textual values in the metadata records.
    137 
    138 
    139 Identify related ontologies:
    140 LT-World \cite{Joerg2010}
    141 
    142 task:
    143 \begin{enumerate}
    144 \item  express MDRecords in RDF
    145 \item  identify related ontologies/vocabularies (category $\rightarrow$ vocabulary)
    146 \item  use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
    147 
    148 %\fbox{ function lookup: Category x String -> ConceptualDomain}
    149 \begin{eqnarray*}
    150 lookup(Category, Literal) \rightarrow ConceptualDomain??
    151 \end{eqnarray*}
    152 
    153 
    154 Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
    155 \end{enumerate}
    156 
    157 
    158 \subsection{Linked Data - Express dataset in RDF}
    159 
    160 
    161 I do think that ISOcat, CLAVAS, RELcat, an actual language
    162 resource all provide a part of the semantic network.
    163 
    164 And if you can express these all in RDF, which we can for almost all of them (maybe
    165 except the actual language resource ... unless it has a schema adorned
    166 with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for
    167 metadata we have that in the CMDI profiles ...) you could load all the
    168 relevant parts in a triple store and do your SPARQL/reasoning on it. Well
    169 that's where I'm ultimately heading with all these registries related to
    170 semantic interoperability ... I hope ;-)
    171 \todocite{Menzo}
    172 
    173 
    174 Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
    175 So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud.
    176 
    177 
    178 Technical aspects (RDF-store?) / interface (ontology browser?)
    179 
    180 \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
    181 
    182 \todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
    183 
    184 defining the Mapping:
    185 \begin{enumerate}
    186 \item convert to RDF
    187 translate: MDRecord $\rightarrow$ [\#mdrecord \#property literal]
    188 \item map: \#mdrecord \#property literal  $\rightarrow$ [\#mdrecord \#property \#entity]
    189 \end{enumerate}
    190 
    191 
    192 \begin{figure*}[!ht]
    193 \includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
    194 \caption{The process of transforming the CMD metadata records to and RDF representation}
    195 \label{fig:smc_cmd2lod}
    196 \end{figure*}
    197 
    198 
    199 \section{Semantic Search}
     137\subsection{Initialization}
     138
     139First there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{components}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
     140\newline
     141
     142\textit{datcatURI $\mapsto$ profile.component.element[]}
     143\newline
     144
     145The collected data categories are enriched with information from corresponding registries (DCRs), adding the verbose identifier, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface.
     146
     147Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
     148
     149
     150\section{Concept-based search}
    200151
    201152Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
     
    213164Synonym Expansion (via TermExtraction(ContentSet))
    214165
     166
    215167\subsection{Query Expansion}
    216168
     169
     170
     171\subsection{SMC as module for Metadata Repository}
     172
     173(MD)search frameworks:
     174
     175\begin{description}
     176\item[Zebra/Z39.50] JZKit
     177\item[Lucene/Solr]
     178\item[eXist] - xml DB
     179\end{description}
     180
     181
     182
     183\section{User Interface?}
     184
     185\subsection*{Query Input}
     186
     187\subsection*{Columns}
     188
     189\subsection*{Summaries}
     190
     191\subsection*{Differential Views}
     192Visualize impact of given mapping in terms of covered dataset (number of matched records).
     193
     194\subsection*{Visualization}
     195Landscape, Treemap, SOM
     196
     197Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf
    217198
    218199\section{Semantic Mapping in Metadata vs. Content/Annotation}
  • SMC4LRT/chapters/Implementation.tex

    r3204 r3233  
    55
    66
    7 The core function of the SMC is implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java. There is also a plan to provide an XQuery implementation. The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
    87
    9 The Semantic Mapping module is based on the DCR and CMD framework and is being developed as a separate service on the side of CLARIN  Metadata Service, its primary consuming service, but shall be equally usable by other applications.
    10 
    11 
    12 \section{Initialization}
    13 
    14 First there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{components}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
    15 \newline
    16 
    17 \textit{datcatURI $\mapsto$ profile.component.element[]}
    18 \newline
    19 
    20 The collected data categories are enriched with information from corresponding registries (DCRs), adding the verbose identifier, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface.
    21 
    22 Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
    23 
    24 
    25 \section{SMC as module for Metadata Repository}
    26 
    27 (MD)search frameworks:
    28 
    29 \begin{description}
    30 \item[Zebra/Z39.50] JZKit
    31 \item[Lucene/Solr]
    32 \item[eXist] - xml DB
    33 \end{description}
    34 
    35 
    36 
    37 \section{SMC Browser}
    38 
    39 Explore the Component Metadata Framework
    40 
    41 In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted (Broeder et al., 2010).
    42 
    43 Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (componentA -includes-> componentB) or referencing (elementA -refersTo-> datcat1).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected).
    44 
    45 SMC Browser visualizes this graph structure in an interactive fashion. You can have a look at the examples for inspiration.
    46 
    47 It is implemented on top of wonderful js-library d3, the code checked in clarin-svn (and needs refactoring). More technical documentation follows soon.
    48 
    49 The graph is constructed from all profiles defined in the Component Registry. To resolve name and description of data categories referenced in the CMD elements definitions of all (public) data categories from DublinCore and ISOcat (from the Metadata Profile [RDF] - retrieving takes some time!) are fetched. However only data categories used in CMD will get part of the graph. Here is a quantitative summary of the dataset.
    50 
    51 
    52 \begin{figure*}[!ht]
    53 \includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23}
    54 \caption{Screenshot of the SMC browser}
    55 \end{figure*}
    56 
    57 
    58 \section{SMC LOD}
    59 
    60 \todoin{read: Europeana RDF Store Report}
    61 
    62 \todocode{install Jena +  fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site}
    63 
    64 \todocode{check install siren}\furl{http://siren.sindice.com/}
    65 \todocode{check install Virtuoso}\furl{http://ods.openlinksw.com/wiki/ODS/}
    66 \todocode{check install Neo4J}
    67 \todocode{check install ontology browser}
    68 
    69 semantic search component in the Linked Media Framework
    70 \todocode{!!! check install LMF - kiwi - SemanticSearch !!!}\furl{http://code.google.com/p/kiwi/wiki/SemanticSearch}
    71 
    72 \todoin{check SARQ}\furl{http://github.com/castagna/SARQ}
    73 
    74 \todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?}
    75 
    76 
    77 \section{User Interface?}
    78 
    79 \subsection*{Query Input}
    80 
    81 \subsection*{Columns}
    82 
    83 \subsection*{Summaries}
    84 
    85 \subsection*{Differential Views}
    86 Visualize impact of given mapping in terms of covered dataset (number of matched records).
    87 
    88 \subsection*{Visualization}
    89 Landscape, Treemap, SOM
    90 
    91 Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf
  • SMC4LRT/chapters/Results.tex

    r3204 r3233  
    156156
    157157
    158 %\section{Usability}
     158\section{SMC-Browser Advanced Interactive User Interface}
     159
     160Explore the Component Metadata Framework
     161
     162In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted (Broeder et al., 2010).
     163
     164Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (componentA -includes-> componentB) or referencing (elementA -refersTo-> datcat1).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected).
     165
     166SMC Browser visualizes this graph structure in an interactive fashion. You can have a look at the examples for inspiration.
     167
     168It is implemented on top of wonderful js-library d3, the code checked in clarin-svn (and needs refactoring). More technical documentation follows soon.
     169
     170The graph is constructed from all profiles defined in the Component Registry. To resolve name and description of data categories referenced in the CMD elements definitions of all (public) data categories from DublinCore and ISOcat (from the Metadata Profile [RDF] - retrieving takes some time!) are fetched. However only data categories used in CMD will get part of the graph. Here is a quantitative summary of the dataset.
     171
     172
     173\begin{figure*}[!ht]
     174\includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23}
     175\caption{Screenshot of the SMC browser}
     176\end{figure*}
     177
     178
Note: See TracChangeset for help on using the changeset viewer.