source: SMC4LRT/SMC.tex @ 2671

Last change on this file since 2671 was 2671, checked in by vronk, 11 years ago

mostly outsourcing individual chapters to separate tex-files

File size: 11.7 KB
Line 
1
2\section{Semantic Mapping on concept level}
3
4merging the pieces of information provided by those,
5offering them semi-transaprently to the user (or application) on the consumption side.
6
7a module of the Component Metadata Infrastructure performing semantic mapping on search indexes. This  builds the base for query expansion to facilitate semantic search and enhance recall when querying the Metadata Repository.
8
9
10\subsection{smcIndex}\label{indexes}
11In this section we describe \emph{smcIndex} -- the data type for input and output of the proposed application.
12An smcIndex is a human-readable string adhering to a specific syntax, denoting some search index.
13The generic syntax is:
14\begin{eqnarray*}
15smcIndex ::= context \ contextSep \ conceptLabel
16\end{eqnarray*}
17
18We distinguish two types of smcIndexes: (i) \emph{dcrIndex} referring to data categories and (ii) \emph{cmdIndex} denoting a specific
19"CMD-entity", i.e. a metadata field, component or whole profile defined within CMD. The \textit{cmdIndex} can be interpreted as a XPath into the instances of CMD-profiles. In contrast to it, the \textit{dcrIndexes} are generally not directly applicable on existing data, but can be understood as abstract indexes referring to well-defined concepts -- the data categories -- and for actual search they need to be resolved to the metadata fields they are referred by. In return one can expect to match more metadata fields from multiple profiles, all referring to the same data category.
20
21These two types of smcIndex also follow different construction patterns:
22\begin{eqnarray*}
23smcIndex & ::= & dcrIndex \ | \ cmdIndex  \\
24dcrIndex & ::= & dcrID \ contextSep \ datcatLabel \\
25cmdIndex & ::= & profile \  \\
26                      &  &  | \  [\ profile \ contextSep \ ] \ dotPath \\
27dotPath  & ::= & [\ dotPath \ pathSep \ ] \ elemName \\
28contextSep & ::= & \texttt{`.`} \ | \  \texttt{`:`} \\
29pathSep & ::= & \texttt{`.`} \\
30dcrId & ::= & \texttt{`isocat`} \ | \ \texttt{`dc`}
31\end{eqnarray*}
32
33The grammar is based on the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (\texttt{dc.title}) and on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} (\texttt{Session.Location.Country}).
34
35\textit{dcrID} is a shortcut referring to a data category registry
36%\footnote{Next to ISOcat other registries can function as a DCR, e.g., the Dublin Core set of metadata terms.}
37similar to the namespace-mechanism in XML-documents.  \textit{datcatLabel} is the verbose Identifier- (e.g. \texttt{telephoneNumber}) or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category.
38% While it is desirable to also allow the Name-attribute of the data category (\texttt{telephone number}), especially also the Names defined in other working languages (\texttt{numero di telefono@it, numer telefonu@pl}), special care has to be taken here as these attributes mostly contain white spaces, which could cause problems in downstream components, when parsing a complex query containing such indices.
39\textit{profile} is the name of the profile. % (despite the danger of ambiguity).
40\textit{dotPath} allows to address a leaf element (\texttt{Session.Actor.Role}), or any intermediary XML-element corresponding to a CMD-component (\texttt{Session.Actor})   within a metadata description. %This allows to easily express search in whole components, instead of having to list all individual fields.
41
42Generally, smcIndexes can be ambiguous, meaning they can refer to multiple concepts, or entities (CMD-elements). This is due to the fact that the names of the data categories, and CMD-entities are not guaranteed unique. The module will have to cope with this, by providing on demand the list of identifiers corresponding to a given smcIndex.
43
44%As an important sidenote -- cmdIndexes can be ambiguous, meaning they can refer to multiple entities (metadata fields), examples of valid indexes:
45%\begin{verbatim}
46%Name
47%Actor.Name, Project.Name
48%Session.Actor.Name, Drama.Actor.Name
49%\end{verbatim}
50
51%So we disambiguate (or narrow down the ambiguity) by prefixing context.
52
53
54\subsection{Function}\label{method}
55In this section, we describe the actual task of the proposed application -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas.
56\footnote{Though tightly related, mapping of terms and query expansion are to be seen as two separate functions.}
57% \footnote{This primary usage of SMC for work with user-created query strings explains the need for human-readability of the indices.}
58
59
60\subsubsection*{Initialization}
61
62First there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{components}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
63\newline
64
65\textit{datcatURI $\mapsto$ profile.component.element[]}
66\newline
67
68The collected data categories are enriched with information from corresponding registries (DCRs), adding the verbose identifier, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface.
69
70Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
71
72\subsubsection*{Operation}
73In the operation mode, the application accepts any index (\textit{smcIndex}, cf. \ref{indexes}) and returns a list of corresponding indexes (or only the input index, if no correspondences were found):
74\newline
75
76\textit{smcIndex $\mapsto$ smcIndex[ ]}
77\newline
78
79We can distinguish following levels for this mapping function:
80
81(1) \emph{data category identity} -- for the resolution only the basic data category map derived from Component Registry is employed. Accordingly, only indexes denoting CMD-elements (\textit{cmdIndexes)} bound to a given data category are returned:
82\newline
83
84\texttt{isocat.size $\mapsto$ } \newline 
85\verb|   [teiHeader.extent, |\newline 
86\verb|    TextCorpusProfile.Number]|
87\newline
88
89\textit{cmdIndex} as input is also possible. It is translated to a corresponding data category, proceeding as above:
90\newline
91
92\texttt{imdi-corpus.Name   $\mapsto$ } \newline 
93\verb|   (isocat.resourceName) |$\mapsto$  \newline 
94\verb|   TextCorpusProfile.GeneralInfo.Name|
95\newline 
96
97(2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to cmdIndexes:
98\newline
99
100\texttt{isocat.resourceTitle  $\mapsto$ } 
101\verb|   (+ dc.title) |$\mapsto$  \newline 
102\verb|   [imdi-corpus.Title, | \newline 
103\verb|    TextCorpusProfile.GeneralInfo.Title,| \newline 
104\verb|    teiHeader.titleStmt.title,| \newline 
105\verb|    teiHeader.monogr.title]|
106\newline 
107
108(3) \emph{container data categories} -- further expansions will be possible once the container data categories \cite{SchuurmanWindhouwer2011} will be used. Currently only fields (leaf nodes) in metadata descriptions are linked to data categories. However, at times, there is a need to conceptually bind also the components, meaning that besides the "atomic" data category for \texttt{actorName, there would be also a data category for the complex concept \texttt{Actor}.} 
109Having concept links also on components will require a compositional approach to the task of semantic mapping, resulting in:
110\newline 
111\texttt{Actor.Name $\mapsto$ }\newline
112\verb|    [Actor.Name, Actor.FullName, |\newline
113\verb|     Person.Name, Person.FullName]|
114
115
116\subsection*{Extensions}
117
118A useful supplementary function of the module would be to provide a list of existing indexes.
119That would allow the search user-interface to equip the query-input with autocompletion. Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry.
120
121Once there will be overlapping\footnote{i.e. different relations may  be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function.
122
123Also, use of \emph{other than equivalency relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the SMC, either returning the relation types themselves as well or equip the list of indexes with some similarity ratio.}
124
125
126
127\section{SMC on instance level}
128
129
130\subsection{Mapping from strings to Entities}
131
132Based on the textual values in the Metadata-descriptions find matching entities in selected Ontologies.
133
134Identify related ontologies:
135LT-World \cite{Joerg2010}
136
137task:
138\begin{enumerate}
139\item  express MDRecords in RDF
140\item  identify related ontologies/vocabularies (category -> vocabulary)
141\item  use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
142
143%\fbox{ function lookup: Category x String -> ConceptualDomain}
144\begin{eqnarray*}
145lookup(Category, Literal) -> ConceptualDomain??
146\end{eqnarray*}
147
148
149Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
150\end{enumerate} 
151
152
153\subsection{Semantic Search}
154
155Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
156Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies,
157with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
158
159In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user.
160Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
161
162Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall "explain" - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
163
164?
165Facets
166Controlled Vocabularies
167Synonym Expansion (via TermExtraction(ContentSet))
168
169\subsection{Linked Data - Express dataset in RDF}
170
171Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
172So theoretically we then only need to provide them "on the web", to make them a nucleus of the LinkedData-Cloud.
173
174Practically this won't be that straight-forward as the mapping to entities will be a hell of a work.
175But once that is solved, or for the subsets that it is solved, the publication of that data on the "SemanticWeb" should be easy.
176
177Technical aspects (RDF-store?) / interface (ontology browser?)
178
179defining the Mapping:
180\begin{enumerate}
181\item convert to RDF
182translate: MDREcord -> [\#mdrecord \#property literal]
183\item map: \#mdrecord \#property literal  -> [\#mdrecord \#property \#entity]
184\end{enumerate}
185
186\subsection{Content/Annotation}
187AF + DCR + RR
188
189
190\subsection{Visualization}
191Landscape, Treemap, SOM
192
193Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf
194
Note: See TracBrowser for help on using the repository browser.