Changeset 3233
- Timestamp:
- 08/05/13 13:24:30 (11 years ago)
- Location:
- SMC4LRT/chapters
- Files:
-
- 4 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/chapters/CMD2RDF.tex
r3204 r3233 1 2 \chapter{CMD to RDF} 1 \chapter{Design - Mapping on instance level} 2 3 4 \subsection{Linked Data - Express dataset in RDF} 5 6 7 I do think that ISOcat, CLAVAS, RELcat, an actual language 8 resource all provide a part of the semantic network. 9 10 And if you can express these all in RDF, which we can for almost all of them (maybe 11 except the actual language resource ... unless it has a schema adorned 12 with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for 13 metadata we have that in the CMDI profiles ...) you could load all the 14 relevant parts in a triple store and do your SPARQL/reasoning on it. Well 15 that's where I'm ultimately heading with all these registries related to 16 semantic interoperability ... I hope ;-) 17 \todocite{Menzo} 18 19 20 Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with 21 So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud. 22 23 24 Technical aspects (RDF-store?) / interface (ontology browser?) 25 26 \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/} 27 28 \todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)} 29 30 defining the Mapping: 31 \begin{enumerate} 32 \item convert to RDF 33 translate: MDRecord $\rightarrow$ [\#mdrecord \#property literal] 34 \item map: \#mdrecord \#property literal $\rightarrow$ [\#mdrecord \#property \#entity] 35 \end{enumerate} 36 37 \begin{figure*}[!ht] 38 \includegraphics[width=1\textwidth]{images/SMC_CMD2LOD} 39 \caption{The process of transforming the CMD metadata records to and RDF representation} 40 \label{fig:smc_cmd2lod} 41 \end{figure*} 42 43 44 \section{CMD to RDF} 45 \label{ch:cmd2rdf} 3 46 4 47 A few modules/components of the CMD infrastructure are dedicated to semantic interoperability. The DCR as global registry for concepts, CLAVAS for maintaining controlled vocabularies in SKOS format, RR for expressing arbitrary relations between concepts. 5 48 However, the actual values in the CMD instances are ``just strings'' and for the most part cannot be validated by the schema, although they often could be mapped to a corresponding controlled vocabulary. 6 49 7 Thus one aim of this work is to express the whole of the CMD data (model and instances) in RDF. This would allow to map the string values in selected fields to semantic entities, which in turn would allow real semantic search and bring about a linking with the web of data \todocite{Web of Data, TimBL}50 Thus one aim of this work is to express the whole of the CMD data (model and instances) in RDF. This would allow to map the string values in selected fields to semantic entities, which in turn would allow real semantic (ontology-driven) search and bring about a linking with the web of data \todocite{Web of Data, TimBL} 8 51 9 52 The following chapter lays out, how individual parts of the CMD framework can be expressed in RDF 10 53 11 \s ection{CMD specification}54 \subsection{CMD specification} 12 55 The meta model 13 56 … … 34 77 \end{note} 35 78 36 \s ection{Data Categories}79 \subsection{Data Categories} 37 80 Windhouwer (2012) proposes to use the data categories as annotation properties. 38 81 Definition of the annotation property \code{dcr:datcat} … … 93 136 94 137 95 \s ection{CMD instances}96 97 98 \subs ection {Resource Identifier}138 \subsection{CMD instances} 139 140 141 \subsubsection {Resource Identifier} 99 142 100 143 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID . Alternatively we could use the PID of the MD record ( \code{<lr1.cmd>} from \code{<cmd:MdSelfLink>}) as the resource identifier. … … 126 169 \end{example} 127 170 128 \subsection{Hierarchy ( Resource Proxy â IsPartOf)} 129 In CMD, <cmd:ResourceProxyList> is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modeled as OAI-ORE Aggregation: 171 \subsubsection{Hierarchy ( Resource Proxy â IsPartOf)} 172 In CMD, <cmd:ResourceProxyList> is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modeled as OAI-ORE Aggregation\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations} 173 \furl{http://openannotation.org/spec/core/core.html\#Motivations} 174 : 130 175 131 176 \begin{example} … … 151 196 152 197 153 \subs ection{Components â nested structures}198 \subsubsection{Components â nested structures} 154 199 155 200 \begin{note} … … 172 217 \end{example} 173 218 174 \subs ection{Elements, Fields, Values}219 \subsubsection{Elements, Fields, Values} 175 220 176 221 There are two steps to the modeling of the actual values in the fields of CMD records in RDF. The first one is to express the values as triples with literal values, then for selected fields â using the literal values â try to find corresponding entities in appropriate controlled vocabularies and generate new triples. … … 182 227 \end{example} 183 228 184 \subsubsection{Literal Values} 229 %\subsubsection{Literal Values} 230 \paragraph{Literal Values} 185 231 186 232 Usually, RDF-mapping of dublincore descriptions is to data properties (cf. OLAC-DcmiTerms profile ) … … 207 253 This raises the vice-versa question, whether to rather handle all data categories uniformly, thus encoding dublincore terms also as annotation properties. 208 254 209 \subsubsection{Mapping to entities â Vocabularies â CLAVAS} 255 %\subsubsection{Mapping to entities â Vocabularies â CLAVAS} 256 \paragraph{Mapping to entities â Vocabularies â CLAVAS} 257 210 258 A major (if not the main) motivation for the CMD to RDF mapping is the wish to have better control over and better quality of values in metadata fields with constrained value domain like organization or resource type. As the allowed values for these fields often cannot be explicitly enumerated, it is not possible to restrict them by means of an XML schema. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) 211 259 Thus, one goal of this work is to map (string) values in selected fields to entities defined in corresponding vocabularies. The main provider of relevant vocabularies is ISOcat and CLAVAS â a service for managing and providing vocabularies in SKOS format. Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that for our purposes we can assume OpenSKOS as the one source of vocabularies. … … 228 276 <org1> dcterms:identifier <org1>, <dbpedia/org1>, <lt-world/orgx>; 229 277 \end{example} 278 279 280 281 \paragraph{Mapping from strings to Entities} 282 283 Find matching entities in selected Ontologies based on the textual values in the metadata records. 284 285 286 Identify related ontologies: 287 LT-World \cite{Joerg2010} 288 289 task: 290 \begin{enumerate} 291 \item express MDRecords in RDF 292 \item identify related ontologies/vocabularies (category $\rightarrow$ vocabulary) 293 \item use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?) 294 295 %\fbox{ function lookup: Category x String -> ConceptualDomain} 296 \begin{eqnarray*} 297 lookup(Category, Literal) \rightarrow ConceptualDomain?? 298 \end{eqnarray*} 299 300 301 Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc. 302 \end{enumerate} 303 304 230 305 231 306 \section{RELcat - Ontological relations} … … 260 335 261 336 262 \section{References} 263 264 Schuurman, I. \& Windhouwer., M. Explicit Semantics for Enriched Documents. What Do ISOcat, RELcat and SCHEMAcat Have To Offer? 2nd Supporting Digital Humanities conference (SDH 2011), 17-18 November 2011, Copenhagen, Denmark, 2011 265 Windhouwer, M. \& Wright, S. E. Linking to linguistic data categories in ISOcat Linked Data in Linguistics, Springer, 2012, 99-107 266 267 \furl{http://www.openarchives.org/ore/1.0/primer\#Foundations} 268 \furl{http://openannotation.org/spec/core/core.html\#Motivations} 269 270 271 272 337 338 \section{SMC LOD} 339 340 \todoin{read: Europeana RDF Store Report} 341 342 \todocode{install Jena + fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site} 343 344 \todocode{install older python (2.5?) to be able to install dot2tex - transforming dot files to nicer pgf formatted graphs}\furl{http://dot2tex.googlecode.com/files/dot2tex-2.8.7.zip}\furl{file:/C:/Users/m/2kb/tex/dot2tex-2.8.7/} 345 346 347 \todocode{check install siren}\furl{http://siren.sindice.com/} 348 \todocode{check install Virtuoso}\furl{http://ods.openlinksw.com/wiki/ODS/} 349 \todocode{check install Neo4J} 350 \todocode{check install ontology browser} 351 352 semantic search component in the Linked Media Framework 353 \todocode{!!! check install LMF - kiwi - SemanticSearch !!!}\furl{http://code.google.com/p/kiwi/wiki/SemanticSearch} 354 355 \todoin{check SARQ}\furl{http://github.com/castagna/SARQ} 356 357 \todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?} 358 359 360 361 362 -
SMC4LRT/chapters/Design.tex
r3204 r3233 2 2 \chapter{Semantic Mapping Component - Design} 3 3 \label{ch:design} 4 5 \section{System Architecture} 6 7 The Semantic Mapping module is based on the DCR and CMD framework and is being developed as a separate service on the side of CLARIN Metadata Service, its primary consuming service, but shall be equally usable by other applications. 8 9 10 \todoin{appendix: reference architecture} 11 4 12 5 13 \section{Data Model?} … … 63 71 64 72 65 \section{ Semantic Mapping on conceptlevel}73 \section{Crosswalks -- Mapping on schema level} 66 74 67 75 merging the pieces of information provided by those, … … 118 126 119 127 120 \subsection *{Extensions}128 \subsection{Extensions} 121 129 122 130 A useful supplementary function of the module would be to provide a list of existing indexes. … … 127 135 Also, use of \emph{other than equivalency relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the SMC, either returning the relation types themselves as well or equip the list of indexes with some similarity ratio.} 128 136 129 130 131 \section{Semantic Mapping on instance level} 132 133 134 \subsection{Mapping from strings to Entities} 135 136 Find matching entities in selected Ontologies based on the textual values in the metadata records. 137 138 139 Identify related ontologies: 140 LT-World \cite{Joerg2010} 141 142 task: 143 \begin{enumerate} 144 \item express MDRecords in RDF 145 \item identify related ontologies/vocabularies (category $\rightarrow$ vocabulary) 146 \item use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?) 147 148 %\fbox{ function lookup: Category x String -> ConceptualDomain} 149 \begin{eqnarray*} 150 lookup(Category, Literal) \rightarrow ConceptualDomain?? 151 \end{eqnarray*} 152 153 154 Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc. 155 \end{enumerate} 156 157 158 \subsection{Linked Data - Express dataset in RDF} 159 160 161 I do think that ISOcat, CLAVAS, RELcat, an actual language 162 resource all provide a part of the semantic network. 163 164 And if you can express these all in RDF, which we can for almost all of them (maybe 165 except the actual language resource ... unless it has a schema adorned 166 with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for 167 metadata we have that in the CMDI profiles ...) you could load all the 168 relevant parts in a triple store and do your SPARQL/reasoning on it. Well 169 that's where I'm ultimately heading with all these registries related to 170 semantic interoperability ... I hope ;-) 171 \todocite{Menzo} 172 173 174 Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with 175 So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud. 176 177 178 Technical aspects (RDF-store?) / interface (ontology browser?) 179 180 \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/} 181 182 \todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)} 183 184 defining the Mapping: 185 \begin{enumerate} 186 \item convert to RDF 187 translate: MDRecord $\rightarrow$ [\#mdrecord \#property literal] 188 \item map: \#mdrecord \#property literal $\rightarrow$ [\#mdrecord \#property \#entity] 189 \end{enumerate} 190 191 192 \begin{figure*}[!ht] 193 \includegraphics[width=1\textwidth]{images/SMC_CMD2LOD} 194 \caption{The process of transforming the CMD metadata records to and RDF representation} 195 \label{fig:smc_cmd2lod} 196 \end{figure*} 197 198 199 \section{Semantic Search} 137 \subsection{Initialization} 138 139 First there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{components}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories: 140 \newline 141 142 \textit{datcatURI $\mapsto$ profile.component.element[]} 143 \newline 144 145 The collected data categories are enriched with information from corresponding registries (DCRs), adding the verbose identifier, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface. 146 147 Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories. 148 149 150 \section{Concept-based search} 200 151 201 152 Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources. … … 213 164 Synonym Expansion (via TermExtraction(ContentSet)) 214 165 166 215 167 \subsection{Query Expansion} 216 168 169 170 171 \subsection{SMC as module for Metadata Repository} 172 173 (MD)search frameworks: 174 175 \begin{description} 176 \item[Zebra/Z39.50] JZKit 177 \item[Lucene/Solr] 178 \item[eXist] - xml DB 179 \end{description} 180 181 182 183 \section{User Interface?} 184 185 \subsection*{Query Input} 186 187 \subsection*{Columns} 188 189 \subsection*{Summaries} 190 191 \subsection*{Differential Views} 192 Visualize impact of given mapping in terms of covered dataset (number of matched records). 193 194 \subsection*{Visualization} 195 Landscape, Treemap, SOM 196 197 Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf 217 198 218 199 \section{Semantic Mapping in Metadata vs. Content/Annotation} -
SMC4LRT/chapters/Implementation.tex
r3204 r3233 5 5 6 6 7 The core function of the SMC is implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java. There is also a plan to provide an XQuery implementation. The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.8 7 9 The Semantic Mapping module is based on the DCR and CMD framework and is being developed as a separate service on the side of CLARIN Metadata Service, its primary consuming service, but shall be equally usable by other applications.10 11 12 \section{Initialization}13 14 First there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{components}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:15 \newline16 17 \textit{datcatURI $\mapsto$ profile.component.element[]}18 \newline19 20 The collected data categories are enriched with information from corresponding registries (DCRs), adding the verbose identifier, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface.21 22 Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.23 24 25 \section{SMC as module for Metadata Repository}26 27 (MD)search frameworks:28 29 \begin{description}30 \item[Zebra/Z39.50] JZKit31 \item[Lucene/Solr]32 \item[eXist] - xml DB33 \end{description}34 35 36 37 \section{SMC Browser}38 39 Explore the Component Metadata Framework40 41 In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted (Broeder et al., 2010).42 43 Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (componentA -includes-> componentB) or referencing (elementA -refersTo-> datcat1).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected).44 45 SMC Browser visualizes this graph structure in an interactive fashion. You can have a look at the examples for inspiration.46 47 It is implemented on top of wonderful js-library d3, the code checked in clarin-svn (and needs refactoring). More technical documentation follows soon.48 49 The graph is constructed from all profiles defined in the Component Registry. To resolve name and description of data categories referenced in the CMD elements definitions of all (public) data categories from DublinCore and ISOcat (from the Metadata Profile [RDF] - retrieving takes some time!) are fetched. However only data categories used in CMD will get part of the graph. Here is a quantitative summary of the dataset.50 51 52 \begin{figure*}[!ht]53 \includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23}54 \caption{Screenshot of the SMC browser}55 \end{figure*}56 57 58 \section{SMC LOD}59 60 \todoin{read: Europeana RDF Store Report}61 62 \todocode{install Jena + fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site}63 64 \todocode{check install siren}\furl{http://siren.sindice.com/}65 \todocode{check install Virtuoso}\furl{http://ods.openlinksw.com/wiki/ODS/}66 \todocode{check install Neo4J}67 \todocode{check install ontology browser}68 69 semantic search component in the Linked Media Framework70 \todocode{!!! check install LMF - kiwi - SemanticSearch !!!}\furl{http://code.google.com/p/kiwi/wiki/SemanticSearch}71 72 \todoin{check SARQ}\furl{http://github.com/castagna/SARQ}73 74 \todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?}75 76 77 \section{User Interface?}78 79 \subsection*{Query Input}80 81 \subsection*{Columns}82 83 \subsection*{Summaries}84 85 \subsection*{Differential Views}86 Visualize impact of given mapping in terms of covered dataset (number of matched records).87 88 \subsection*{Visualization}89 Landscape, Treemap, SOM90 91 Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf -
SMC4LRT/chapters/Results.tex
r3204 r3233 156 156 157 157 158 %\section{Usability} 158 \section{SMC-Browser Advanced Interactive User Interface} 159 160 Explore the Component Metadata Framework 161 162 In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted (Broeder et al., 2010). 163 164 Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (componentA -includes-> componentB) or referencing (elementA -refersTo-> datcat1).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected). 165 166 SMC Browser visualizes this graph structure in an interactive fashion. You can have a look at the examples for inspiration. 167 168 It is implemented on top of wonderful js-library d3, the code checked in clarin-svn (and needs refactoring). More technical documentation follows soon. 169 170 The graph is constructed from all profiles defined in the Component Registry. To resolve name and description of data categories referenced in the CMD elements definitions of all (public) data categories from DublinCore and ISOcat (from the Metadata Profile [RDF] - retrieving takes some time!) are fetched. However only data categories used in CMD will get part of the graph. Here is a quantitative summary of the dataset. 171 172 173 \begin{figure*}[!ht] 174 \includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23} 175 \caption{Screenshot of the SMC browser} 176 \end{figure*} 177 178
Note: See TracChangeset
for help on using the changeset viewer.