Context Navigation

source: SMC4LRT/chapters/SMC.tex @ 2696

Last change on this file since 2696 was 2696, checked in by vronk, 11 years ago
added notes from Menzo about DCR CLAVAS interaction
File size: 14.8 KB

Line
1
2	\chapter{Semantic Mapping Component}
3
4
5	\section{Data Model}
6
7	Terms ?
8	move to SKOS ?
9
10	RDF
11
12
13	\subsection{CMD namespace}
14	Describe the CMD-format?
15
16
17	\subsection{DCR in SKOS}
18	\label{dcr-skos}
19	Describe the mapping from DCR into SKOS
20
21	DCR recognizes following types of data categories:
22	simple, complex: closed, open, constrained, (container)?
23
24	\begin{figure*}[!ht]
25	\begin{center}
26	\includegraphics[width=0.7\textwidth]{images/dc_types}
27	\end{center}
28	\caption{Data Category types}
29	\end{figure*}
30	\todo{cite: ISOcat introduction at CLARIN-NL Workshop}
31
32	The export to CLAVAS-SKOS only considers/regards closed and simple DCs from the metadata profile are exported.
33	A closed DC maps to a concept scheme and a simple DC to a SKOS concept in such a concept scheme.
34	However it needs to be yet assessed how useful this approach is. In the metadata profile
35	there are many closed DCs with small value domains. How useful are those
36	in CLAVAS?
37	Originally, the vocabulary repository has been conceived to manage rather large and complex value domains,
38	that do not fit easily in the DCR data-model.
39	Therefore a threshold seems sensible, where only value domains with more
40	then 20, 50 or 100 values are exported.
41
42	Open or constrained DCs are not exported as they don't provide anything to a vocabulary. \todo{cite: Menzo2013-03-12 mail}
43	However, they can become users of a CLAVAS vocabulary. Actually, providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository.
44
45	Currently (before integration of VAS and DCR), the only possibility to constrain the value domain of a data category
46	is by the means a XML Schema provides, like a regular expression. So for the data category \concept{languageID DC-2482}
47	the rule looks like:
48	\lstset{language=XML}
49	\begin{lstlisting}
50	<dcif:conceptualDomain type="constrained">
51	<dcif:dataType>string</dcif:dataType>
52	<dcif:ruleType>XML Schema regular expression</dcif:ruleType>
53	<dcif:rule>[a-z]{3}</dcif:rule>
54	</dcif:conceptualDomain>
55	\end{lstlisting}
56
57	A current proposal by Windhouwer\todo{cite: Menzo2013-03-12 mail} for integration with CLAVAS foresees following extension:
58
59	\begin{lstlisting}
60	<clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
61	\end{lstlisting}
62
63	\code{@href} points to the vocabulary. Actually a PID should be used in the context
64	of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core.
65
66	\code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are
67	valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open.
68
69	This would yield a definition of the conceptualDomain for the data category as follows:
70
71	\lstset{language=XML}
72	\begin{lstlisting}
73	<dcif:conceptualDomain type="constrained">
74	<dcif:dataType>string</dcif:dataType>
75	<dcif:ruleType>XML Schema regular expression</dcif:ruleType>
76	<dcif:rule>[a-z]{3}</dcif:rule>
77	</dcif:conceptualDomain>
78	<dcif:conceptualDomain type="constrained">
79	<dcif:dataType>string</dcif:dataType>
80	<dcif:ruleType>CLAVAS vocabulary</dcif:ruleType>
81	<dcif:rule>
82	<clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
83	</dcif:rule>
84	</dcif:conceptualDomain>
85	\end{lstlisting}
86
87	I.e. the new rule pointing to the vocabulary would be \emph{added}, so that tools that don't support CLAVAS
88	lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
89
90
91	\begin{note}
92
93	\noindent
94	something similar for the link to an EBNF grammar in SCHEMAcat:
95
96	%\begin{lstlisting}
97	\begin{verbatim}
98	<scr:valueSchema
99	xmlns:scr="http://www.isocat.org/ns/scr"
100	pid="http://hdl.handle.net/1839/00-SCHM-0000-0000-004A-A"
101	type="ISO 14977:1996 EBNF"/>
102	\end{verbatim}
103	%\end{lstlisting}
104	\end{note}
105
106
107
108	\subsection{smcIndex}\label{indexes}
109	In this section we describe \emph{smcIndex} -- the data type for input and output of the proposed application.
110	An smcIndex is a human-readable string adhering to a specific syntax, denoting some search index.
111	The generic syntax is:
112	\begin{eqnarray*}
113	smcIndex ::= context \ contextSep \ conceptLabel
114	\end{eqnarray*}
115
116	We distinguish two types of smcIndexes: (i) \emph{dcrIndex} referring to data categories and (ii) \emph{cmdIndex} denoting a specific
117	``CMD-entity'', i.e. a metadata field, component or whole profile defined within CMD. The \textit{cmdIndex} can be interpreted as a XPath into the instances of CMD-profiles. In contrast to it, the \textit{dcrIndexes} are generally not directly applicable on existing data, but can be understood as abstract indexes referring to well-defined concepts -- the data categories -- and for actual search they need to be resolved to the metadata fields they are referred by. In return one can expect to match more metadata fields from multiple profiles, all referring to the same data category.
118
119	These two types of smcIndex also follow different construction patterns:
120	\begin{eqnarray*}
121	smcIndex & ::= & dcrIndex \ \| \ cmdIndex \\
122	dcrIndex & ::= & dcrID \ contextSep \ datcatLabel \\
123	cmdIndex & ::= & profile \ \\
124	& & \| \ [\ profile \ contextSep \ ] \ dotPath \\
125	dotPath & ::= & [\ dotPath \ pathSep \ ] \ elemName \\
126	contextSep & ::= & \texttt{`.`} \ \| \ \texttt{`:`} \\
127	pathSep & ::= & \texttt{`.`} \\
128	dcrId & ::= & \texttt{`isocat`} \ \| \ \texttt{`dc`}
129	\end{eqnarray*}
130
131	The grammar is based on the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (\texttt{dc.title}) and on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} (\texttt{Session.Location.Country}).
132
133	\textit{dcrID} is a shortcut referring to a data category registry
134	%\footnote{Next to ISOcat other registries can function as a DCR, e.g., the Dublin Core set of metadata terms.}
135	similar to the namespace-mechanism in XML-documents. \textit{datcatLabel} is the verbose Identifier- (e.g. \texttt{telephoneNumber}) or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category.
136	% While it is desirable to also allow the Name-attribute of the data category (\texttt{telephone number}), especially also the Names defined in other working languages (\texttt{numero di telefono@it, numer telefonu@pl}), special care has to be taken here as these attributes mostly contain white spaces, which could cause problems in downstream components, when parsing a complex query containing such indices.
137	\textit{profile} is the name of the profile. % (despite the danger of ambiguity).
138	\textit{dotPath} allows to address a leaf element (\texttt{Session.Actor.Role}), or any intermediary XML-element corresponding to a CMD-component (\texttt{Session.Actor}) within a metadata description. %This allows to easily express search in whole components, instead of having to list all individual fields.
139
140	Generally, smcIndexes can be ambiguous, meaning they can refer to multiple concepts, or entities (CMD-elements). This is due to the fact that the names of the data categories, and CMD-entities are not guaranteed unique. The module will have to cope with this, by providing on demand the list of identifiers corresponding to a given smcIndex.
141
142	%As an important sidenote -- cmdIndexes can be ambiguous, meaning they can refer to multiple entities (metadata fields), examples of valid indexes:
143	%\begin{verbatim}
144	%Name
145	%Actor.Name, Project.Name
146	%Session.Actor.Name, Drama.Actor.Name
147	%\end{verbatim}
148
149	%So we disambiguate (or narrow down the ambiguity) by prefixing context.
150
151
152	\subsection{Query language}
153	CQL?
154
155
156	\section{Semantic Mapping on concept level}
157
158	merging the pieces of information provided by those,
159	offering them semi-transaprently to the user (or application) on the consumption side.
160
161	a module of the Component Metadata Infrastructure performing semantic mapping on search indexes. This builds the base for query expansion to facilitate semantic search and enhance recall when querying the Metadata Repository.
162
163
164	In this section, we describe the actual task of the proposed application -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas.
165	\footnote{Though tightly related, mapping of terms and query expansion are to be seen as two separate functions.}
166	% \footnote{This primary usage of SMC for work with user-created query strings explains the need for human-readability of the indices.}
167
168	In the operation mode, the application accepts any index (\textit{smcIndex}, cf. \ref{indexes}) and returns a list of corresponding indexes (or only the input index, if no correspondences were found):
169	\newline
170
171	\textit{smcIndex $\mapsto$ smcIndex[ ]}
172	\newline
173
174	We can distinguish following levels for this mapping function:
175
176	(1) \emph{data category identity} -- for the resolution only the basic data category map derived from Component Registry is employed. Accordingly, only indexes denoting CMD-elements (\textit{cmdIndexes)} bound to a given data category are returned:
177	\newline
178
179	\texttt{isocat.size $\mapsto$ } \newline
180	\verb\| [teiHeader.extent, \|\newline
181	\verb\| TextCorpusProfile.Number]\|
182	\newline
183
184	\textit{cmdIndex} as input is also possible. It is translated to a corresponding data category, proceeding as above:
185	\newline
186
187	\texttt{imdi-corpus.Name $\mapsto$ } \newline
188	\verb\| (isocat.resourceName) \|$\mapsto$ \newline
189	\verb\| TextCorpusProfile.GeneralInfo.Name\|
190	\newline
191
192	(2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to cmdIndexes:
193	\newline
194
195	\texttt{isocat.resourceTitle $\mapsto$ }
196	\verb\| (+ dc.title) \|$\mapsto$ \newline
197	\verb\| [imdi-corpus.Title, \| \newline
198	\verb\| TextCorpusProfile.GeneralInfo.Title,\| \newline
199	\verb\| teiHeader.titleStmt.title,\| \newline
200	\verb\| teiHeader.monogr.title]\|
201	\newline
202
203	(3) \emph{container data categories} -- further expansions will be possible once the container data categories \cite{SchuurmanWindhouwer2011} will be used. Currently only fields (leaf nodes) in metadata descriptions are linked to data categories. However, at times, there is a need to conceptually bind also the components, meaning that besides the ``atomic'' data category for \texttt{actorName, there would be also a data category for the complex concept \texttt{Actor}.}
204	Having concept links also on components will require a compositional approach to the task of semantic mapping, resulting in:
205	\newline
206	\texttt{Actor.Name $\mapsto$ }\newline
207	\verb\| [Actor.Name, Actor.FullName, \|\newline
208	\verb\| Person.Name, Person.FullName]\|
209
210
211	\subsection*{Extensions}
212
213	A useful supplementary function of the module would be to provide a list of existing indexes.
214	That would allow the search user-interface to equip the query-input with autocompletion. Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry.
215
216	Once there will be overlapping\footnote{i.e. different relations may be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function.
217
218	Also, use of \emph{other than equivalency relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the SMC, either returning the relation types themselves as well or equip the list of indexes with some similarity ratio.}
219
220
221
222	\section{Semantic Mapping on instance level}
223
224
225	\subsection{Mapping from strings to Entities}
226
227	Find matching entities in selected Ontologies based on the textual values in the metadata records.
228
229
230	Identify related ontologies:
231	LT-World \cite{Joerg2010}
232
233	task:
234	\begin{enumerate}
235	\item express MDRecords in RDF
236	\item identify related ontologies/vocabularies (category -> vocabulary)
237	\item use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
238
239	%\fbox{ function lookup: Category x String -> ConceptualDomain}
240	\begin{eqnarray*}
241	lookup(Category, Literal) -> ConceptualDomain??
242	\end{eqnarray*}
243
244
245	Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
246	\end{enumerate}
247
248
249	\subsection{Linked Data - Express dataset in RDF}
250
251	Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
252	So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud.
253
254
255	Technical aspects (RDF-store?) / interface (ontology browser?)
256
257	defining the Mapping:
258	\begin{enumerate}
259	\item convert to RDF
260	translate: MDREcord -> [\#mdrecord \#property literal]
261	\item map: \#mdrecord \#property literal -> [\#mdrecord \#property \#entity]
262	\end{enumerate}
263
264
265	\begin{figure*}[!ht]
266	\includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
267	\caption{The process of transforming the CMD metadata records to and RDF representation}
268	\end{figure*}
269
270
271	\section{Semantic Search}
272
273	Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
274	Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies,
275	with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
276
277	In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user.
278	Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
279
280	Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall ``explain'' - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
281
282	?
283	Facets
284	Controlled Vocabularies
285	Synonym Expansion (via TermExtraction(ContentSet))
286
287	\subsection{Query Expansion}
288
289
290	\section{Semantic Mapping in Metadata vs. Content/Annotation}
291	AF + DCR + RR
292
293
294
295

Note: See TracBrowser for help on using the repository browser.

Download in other formats: