Changeset 3140


Ignore:
Timestamp:
07/15/13 19:02:41 (11 years ago)
Author:
vronk
Message:

update chapters
added some info about META-SHARE, TEI, Data in general

Location:
SMC4LRT/chapters
Files:
8 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/Data.tex

    r2704 r3140  
    1010\subsection{CMD-Framework}
    1111
     12
     13
     14\subsubsection{CMD Profiles }
     15In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
     16
     17Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements
     18(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
     19
     20
     21\begin{table}
     22\caption{The development of defined profiles and DCs over time}
     23\label{table:dev}
     24  \begin{tabular}{ l | r | r | r | r }
     25    \hline
     26date     & 2011-01 & 2012-06 & 2013-01 & 2013-06  \\
     27    \hline
     28Profiles & 40 & 53 & 87 & 124 \\
     29Distinct Components & 164 & 298 & 542 & 828 \\
     30Expanded Components & 1055 & 1536 & 2904 & 5757 \\
     31Distinct Elements & 511 & 893 & 1505 & 2399 \\
     32Expanded Elements & 1971 & 3030 & 5754 & 13232 \\
     33Distinct data categories & 203 & 266 & 436 & 499 \\
     34Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\
     35Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% \\
     36Components with DCs & 28 & 67 & 115 & 140 \\
     37
     38    \hline
     39  \end{tabular}
     40\end{table}
     41
     42
     43\subsection{Instance Data}
     44
     45
     46\todoin{probably more numbers about CMD records (collections, used profiles, ...) (in historical perspective?)}
     47
     48On the instance level, in the harvested data 60 distinct profiles can be found.
     49
     50The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
     51collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records.
     5216 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
     53On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
     54
     55
     56\begin{table}
     57\caption{Top 20 profiles, with the respective number of records}
    1258\begin{center}
    13   \begin{tabular}{ l | r }
    14     \hline
    15 created & 2013-01-26 \\ \hline
    16 Profiles & 87 \\
    17 Components & 2904 \\
    18 distinct Components & 542 \\
    19 Elements & 5754 \\
    20 distinct Elements & 1505 \\
    21 distinct DatCats & 436 \\
    22 Elements with DatCats & 1183 \\
    23 Elements without DatCats & 323 \\
    24 ratio of elements without DatCats & 21.46 \% \\
    25 available Concepts & 893 \\
    26 used Concept & 474 \\
    27 blind Concepts (not in public ISOcat) & 190 \\
    28 Concepts not used in CMD & 539 \\
     59  \begin{tabular}{ r | l }
     60\# records & profile \\
     61    \hline
     62155.403 & Song \\
     63138.257 & Session \\
     6492.996 & OLAC-DcmiTerms \\
     6546.156 & DcmiTerms \\
     6628.448 & SongScan \\
     6721.256 & SourceScan \\
     6819.059 & LiteraryCorpusProfile \\
     6916519 & Source \\
     7013626 & imdi-corpus \\
     7110610 & media-session-profile \\
     727961 & SongAudio \\     
     737557 & SymbolicMusicNotation \\
     744485 & LCC DataProviderProfile \\
     754485 & SourceProfile \\
     764417 & Text \\
     771982 & Soundbites-recording \\
     781530 & Performer \\
     791475 & ArthurianFiction \\
     80939 & LrtInventoryResource \\
     81873 & teiHeader \\
    2982    \hline
    3083  \end{tabular}
    3184\end{center}
    32 
    33 \todoin{Collect number about CMD-Framework (profiles, datcats) + historical development}
    34 
    35 \todoin{Collect numbers about CMD records (collections, used profiles, ...) in historical perspective}
     85\end{table}
     86
     87We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).
    3688
    3789
     
    4092DC, OLAC
    4193
     94openarchives register: \url{http://www.openarchives.org/Register/BrowseSites}
     95 2006 OAI-repositories
     96
    4297DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/}
     98
     99DublinCore to RDF mapping\furl{http://dublincore.org/documents/dcq-rdf-xml/}
     100
     101\label{def:OLAC}
     102A more specific version of the dublincore terms, adapted to the needs of the linguistic community is the
     103OLAC\furl{http://www.language-archives.org/}format\cite{Bird2001}
     104
     105OLAC \cite{Simons2003OLAC}.
     106
     107\todoin{check http://www.language-archives.org/OLAC/metadata.html}
     108
     109\begin{quotation}
     110The OLAC metadata set is the set of metadata elements that participating archives have agreed to use for describing language resources. Uniform description across archives is ensured by limiting the values of certain metadata elements to the use of terms from agreed-upon controlled vocabularies. The OLAC metadata set is equally applicable whether the resources are available online or not. The metadata set consists of the fifteen elements of the Dublin Core Metadata Set, plus the refinements and encoding schemes of the DCMI Metadata Terms—a widely accepted standard for describing resources of all types. To this general standard, OLAC adds encoding schemes that are designed specifically for describing language resources, such as subject language and linguistic data type. The OLAC Metadata Usage Guidelines describe (with examples) all the elements, refinements, and encoding schemes that may be used in OLAC metadata descriptions. The OLAC Metadata standard defines the XML format that is used for the interchange of metadata descriptions among participating archives.
     111\end{quotation}
     112
     113
     114
    43115
    44116\subsection{TEI / teiHeader}
     
    50122
    51123\subsection{Europeana Data Model - EDM}
     124
     125\subsection{META-SHARE}
     126META-SHARE is another multinational project aiming to build an infrastructure for language resource\cite{Piperidis2012meta}, however focusing more on Human Language Technologies domain.\furl{http://meta-share.eu}
     127
     128\begin{quotation}
     129META-NET is designing and implementing META-SHARE, a sustainable network of repositories of language data, tools and related web services documented with high-quality metadata, aggregated in central inventories allowing for uniform search and access to resources. Data and tools can be both open and with restricted access rights, free and for-a-fee. META-SHARE targets existing but also new and emerging language data, tools and systems required for building and evaluating new technologies, products and services.
     130\end{quotation}
     131
     132\begin{quotation}
     133META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
     134
     135META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
     136
     137\end{quotation}
     138
     139The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
     140
     141A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
     142The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
     143
     144
     145MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
    52146
    53147
     
    84178
    85179AAT - international Architecture and Arts Thesaurus
    86 GND - Gemeinsame Norm Datei
     180GND - Gemeinsame Norm Datei (GND ontology\furl{http://d-nb.info/standards/elementset/gnd}
    87181GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
    88182VIAF - Virtual International Authority File
     
    147241\section{Other Metadata Catalogs/Collections}
    148242
    149 Digital Libraries
    150 \subsubsection{(Digital) Libraries}
     243\subsection{(Digital) Libraries}
    151244
    152245
  • SMC4LRT/chapters/Definitions.tex

    r2703 r3140  
    11\chapter{Definitions}
     2
     3Meanings of ``mapping'':
     4\begin{itemize}
     5\item transform 
     6\item match (schemas)
     7\item  overview (browser)
     8\end{itemize} 
     9
    210
    311\section {Namespaces}
     
    1119\section {Abbreviations}
    1220
     21\begin{description}
     22\item[CLARIN] \textit{Common Language Resources and Technology Infrastructure} \ref{def:CLARIN}
     23\item[CLAVAS] \textit{Vocabulary Alignement Service for CLARIN} \ref{def:CLAVAS}
     24\item[CMD] \textit{Component Metadata} \ref{def:CMD}
     25\item[CMDI] \textit{Component Metadata Infrastructure} \ref{def:CMDI}
     26\item[ERIC] \textit{European Research Infrastructure  Consortium} - a legal entity for long-term research infrastructure initiatives
     27\item[DC] data category
     28\item[DCR] data category registry \cite{ISO12620:2009}
     29\item[OLAC] \textit{Open Language Archive Community}\furl{http://www.language-archives.org/}\ref{def:OLAC}
     30\item[PID] persistend identifier \todocite{PID}
     31\item[PURL] persistent uniform resource locator \todocite{PURL}
     32\item[RDF] \textit{Resource Description Framework} \todocite{RDF}
     33\item[RR] Relation Registry\ref{def:rr} 
     34\item[TEI] \textit{Text Encoding Initiative}
     35\end{description}
    1336
    1437\section {Terms}
     
    1740
    1841\begin{description}
    19 \item[Concept]  sense, idea, philosophical problem, which we don't need to discuss here. For our purposes we say: Basic "entity" in an ontology? that of what an ontology is build
    20 \item[Ontology]  ``an explicit specification of a conceptualization'' \todoin{cite!}, but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
     42\item[Concept]  Basic "entity" in an ontology? that of what an ontology is build
     43\item[Ontology]  ``an explicit specification of a conceptualization'' \todocite {Ontology!}, but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
    2144\item[Word]  a lexical unit, a word in a language, something that has a surface realization (writtenForm) and is a carrier of sense. so a relation holds: hasSense(Word, Concept)
    2245\item[Lexicon]  a collection of words, a (lexical) vocabulary
  • SMC4LRT/chapters/Evaluation.tex

    r2672 r3140  
    2727Project, Institution, Person, Publisher
    2828
    29 \section{Usability}
     29
     30\section{Exploring Data Categories}
     31In the ISOcat DCR 791 DCss are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed} In the following we describe two show cases -- \textit{Language} and \textit{name} -- in more detail.
     32
     33\subsection{Language}
     34While there are 69 components and 97 elements containing a substring `language' defined in the CR
     35still only 19 distinct DCs with a `language' substring are being used\footnote{Here the term `used' means referenced in CMD components and elements.}. The most commonly used ones:
     36\textit{languageID} (\texttt{DC-2482}) and \textit{languageName} (\texttt{DC-2484}) are referenced by more than 80 profiles.
     37Additionally, these two DCs are linked to the Dublin Core term \textit{Language} in the RR.
     38Thus a search engine capable of interpreting RR information could offer the user a simple Dublin Core-based search interface, while -- by expanding the query -- still searching over all available data, and, moreover, on demand offer the user a more finegrained semantic interpretation for the matches based on the originally assigned DCs. Figure \ref{fig:language_datcats} depicts the relations between the language data categories and their usage in the profiles. We encounter all types of situations: profiles using only \textit{dc:Language} or \textit{dcterms:Language}, \textit{isocat:languageId} or \textit{isocat:languageName},
     39most profiles use both \textit{isocat:languageId} and \textit{isocat:languageName} and there are even profiles that refer to both \textit{isocat} and \textit{dublincore} data categories (\textit{data}, \textit{HZSKCorpus}, \textit{ToolService}).
     40
     41
     42\begin{figure*}[!ht]
     43\begin{center}
     44\includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf}
     45\end{center}
     46\caption{The four main \textit{Language} data categories and in which profiles they are being used}
     47\label{fig:language_datcats}
     48\end{figure*}
     49
     50It requires further inspection and in the end a case by case decision, if the other less often used `language' DCs can be treated as equivalent to the above mentioned ones.
     51\textit{languageScript}, \textit{implementationLanguage}, as well as \textit{noLanguages} or  \textit{sizePerLanguage} clearly do not belong to the language cluster.
     52But \textit{sourceLanguage}, \textit{languageMother} or \textit{participantDominantLanguage} can at least be expected to share the same value domain (natural languages) and even if they do not describe the language of the resource, they could be considered when one aims at maximizing the recall (i.e., trying to find anything related to a given language). This is actually exactly the scenario the RR was conceived for -- allow to define custom relation sets based on specific needs of a project or of a research question.
     53
     54
     55\subsection{Name}
     56There are as many as 72 CMD elements with the label \texttt{Name}, referring to 12 different DCs.
     57Again the main DC \textit{resourceName} (\texttt{DC-2544}) being used in 74 profiles together with the semantically near \textit{resourceTitle} (\texttt{DC-2545}) used in 69 profiles offer a good coverage over available data.
     58
     59Some of the DCs referenced by \texttt{Name} elements are \textit{author} (\texttt{DC-4115}), \textit{contact full name} (\texttt{DC-2454}), \textit{dcterms:Contributor}, \textit{project name} (\texttt{DC-2536}), \textit{web service name} (\texttt{DC-4160}) and \textit{language name} (\texttt{DC-2484}). This implies, that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.
     60
     61\subsection{Resource type}
     62
     63\subsection{Subject, Genre, Topic}
     64
     65\section{Mapping existing Formats}
     66
     67\subsection{dublincore / OLAC}
     68
     69Very widely used format
     70\ref{info:olac-records}
     71
     72There are 4-5 CMD profiles modelling OLAC/dcmi-terms
     73
     74
     75
     76\subsection{teiHeader}
     77
     78TEI is a de-facto standard for encoding any kind of textual resources. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description / metadata the complex element \code{teiHeader} is foreseen.
     79TEI does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs.
     80Thus there is also not just one fixed \xne{teiHeader}.
     81
     82The widespread use of TEI for encoding textual resources  brings about a strong interest of multiple research teams of the CLARIN community to integrate TEI with CMDI. There was a first attempt already in 2010, modelling the recommended \xne{teiHeader}\furl{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html\#HD7}, encoding \xne{fileDesc} and \xne{profileDesc} components, leaving out \xne{encodingDesc} and \xne{revisionDesc}. The leaf elements were bound to the most prominent data categories, making it a mixture of both dublincore and isocat.
     83
     84The large research project \xne{Deutsches Text Archiv}\furl{http://deutschestextarchiv.de/}\todocite{DTA}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information.
     85\todoin{Why a separate cmd-profile}
     86
     87\xne{Nederlab} is another large-scale project concerned with \todoin{dutch? historic texts}, starting 2013 in Netherlands\todocite{Nederlab}. Within this project another set of CMD profiles was created, however reusing existing components.
     88As seen in figure \ref{fig:teiHeadeer_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added.
     89
     90Another approach was applied within the context of other CLARIN-NL projects, \todocite{Windhouwer, 2012} generated, based on an ODD-file, a data category for every element of the teiHeader (135 datcats) creating a dedicated data category selection: \xne{TEI Header (2.1.0)}. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:components}. The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.
     91This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question.
     92
     93\begin{figure*}[!ht]
     94\begin{center}
     95\includegraphics[width=0.75\textwidth]{images/teiHeader_DBNL.png}
     96\end{center}
     97\caption{The reuse of components between the original teiHeader-profile (2010) and the profiels used in Nederlab project}
     98\label{fig:teiHeader_DBNL}
     99\end{figure*}
     100
     101\begin{table}
     102\caption{Overview of TEI-related CMD profiles}
     103\label{table:tei-profiles}
     104  \begin{tabular}{ l | r | l | r | r | r}
     105    \hline
     106project, author & created & profile name & comp elem datcats & instances \\
     107    \hline
     108Deutsches Text Archiv & 2012 & teiHeader & 56/82/10 & 857 \\
     109ICLTT, Durco & 2010 & teiHeader & 16/35/13 & 467 \\
     110Leipzig Corpora, Eckart & 2012 & TEIDocumentDescription & 16/35/13 & ? \\
     111Nederlab, Zhang & 2013 & DBNL\_Tekst & 20/38,15 & ? \\
     112  & & DBNL\_Tekst\_Onzelfstandig & 20/47/21 & ? \\
     113    \hline
     114  \end{tabular}
     115\end{table}
     116
     117\todoin{DBNL\_Tekst\_Onzelfstandig - how many instances?}
     118
     119DBNL\_Tekst clarin.eu:cr1:p\_1361876010678,
     120clarin.eu:cr1:p 1366279029218 (private)
     121
     122\subsection{META-SHARE}
     123
     124
     125META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
     126%In cooperation between metadata teams from CLARIN and META-SHARE
     127The model has been expressed as 4 CMD profiles for distinct resource types sharing most of the components. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 419 components and 1587 elements (when expanded). Although most of the elements are optional
     128
     129resourceInfo    419     1587    72      790     797     50.22 %
     130\todoin{how many distinct components/elements}
     131This? shows nicely the trade-off between the two different approaches between CMD and META-SHARE: many custom schemas or one very large.
     132
     133In a parallel effort, LINDAT, the czech national infrastructure initiative with ties to both CLARIN and META-SHARE, created a CMD profile modelling the minimal obligatory set of META-SHARE. combined with dublincore.
     134So the information is partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema
     135
     136resourceInfo    65      92      21      82      10      10.87 %
     137
     138
     139\begin{figure*}[!ht]
     140\begin{center}
     141\includegraphics[width=0.75\textwidth]{images/LINDAT-profile-data.png}
     142\end{center}
     143\caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
     144\label{fig:META-SHARE-LINDAT}
     145\end{figure*}
     146
     147\begin{figure*}[!ht]
     148\begin{center}
     149\includegraphics[height=1\textheight]{images/resourceInfoBIG.png}
     150\end{center}
     151\caption{the META-SHARE based profile for describing corpora}
     152\label{fig:META-SHARE-BIG}
     153\end{figure*}
     154
     155
     156
     157%\section{Usability}
  • SMC4LRT/chapters/Infrastructure.tex

    r2703 r3140  
    44
    55\section{CLARIN / CMDI}
    6 
     6\label{def:CLARIN}
    77CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is to
    88
     
    2020\item Data Category Registry
    2121\item Relation Registry
    22 \item Schema Registry
     22\item Schema Registry (SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html})
    2323\item Component Registry
    2424\item Vocabulary Alignement Service (OpenSKOS)
     
    4747The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
    4848However there needs to be an additional means to capture information about relations between data categories.
    49 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler.
     49This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler.
    5050% These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
    5151
     
    6868from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
    6969
    70 Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this novel mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{method}.
     70Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{method}.
    7171
    7272\subsection{Vocabulary Service / Reference Data Registry}
     
    321321The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
    322322
    323 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search}
     323More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search}
    324324\todocite {MI Search Engine}
    325325
     
    330330\section{Content Repositories}
    331331Metadata is only one aspect of the availability of resources. It is the first step to announce and describe the resources. However it is of little value, if the resources themselves are not equally well accessible. Thus another pillar of the CLARIN infrastructure are Content Repositories - centres to ensure availability of resources.
     332
     333RDF-stores in Content Repositories (Fedora, ..)
    332334
    333335The requirements for these repositories: PIDs, CMD, OAI-PMH
     
    344346\item[OAI-PMH]
    345347\end{description}
     348
     349\section{Summary}
  • SMC4LRT/chapters/Introduction.tex

    r2703 r3140  
    1919\subsection{Problem statement}
    2020
    21 While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.
     21While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by influence from different schools of thought. (cf. \ref{ch:data})
    2222
    2323\todoin{Need some number about the disparity in the field, number of institutes, resources, formats.}
    2424
    25 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
     25This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process has gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
    2626
    2727
     
    5454
    5555Finally, in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation
    56 in which we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision indicators. A separate evaluation of the usability of the Semantic Search component  is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work.
     56in which we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures. A separate evaluation of the usability of the Semantic Search component  is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work.
    5757
    5858\begin{itemize}
     
    6666This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components and the results and findings of the \emph{evaluation}.
    6767
    68 One promising by-product of the work will be the original dataset expressed as RDF with links into existing external  resources (ontologies, knowledgebases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\footnote{\url{http://linkeddata.org/}} in the \emph{Web of Data}.
     68One promising by-product of the work will be the original dataset expressed as RDF with links into existing external  resources (ontologies, knowledgebases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/} in the \emph{Web of Data}.
    6969
    7070
  • SMC4LRT/chapters/Literature.tex

    r2703 r3140  
    6060This is very much in line with the broad \textit{Linked Open Data} effort as proposed by Berners-Lee \cite{TimBL2006} and being pursuit across many discplines. (This topic is supported also by the EU Commission within the FP7.\footnote{\url{http://cordis.europa.eu/fetch?CALLER=PROJ\_ICT&ACTION=D&CAT=PROJ&RCN=95562}}) A very recent comprehensive overview of the principles of Linked Data and current applications is the book by Heath and Bizer \cite{HeathBizer2011}, that shall serve as a practical guide for this specific task.
    6161
     62Formate:
     63Turtle \furl{http://www.w3.org/TeamSubmission/turtle/\#sec-grammar-comments}
     64RDFa\furl{http://en.wikipedia.org/wiki/RDFa}
     65EDM\furl{http://europeana.ontotext.com/resource/edm/hasType?role=all}
     66
     67
     68\todocite{http://ldl2012.lod2.eu/program/proceedings}
     69\todoin{check LDpath}\furl{http://code.google.com/p/ldpath/}
     70
    6271
    6372\subsection{Schema / Ontology Mapping}
  • SMC4LRT/chapters/SMC.tex

    r2703 r3140  
    180180\todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
    181181
     182\todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
     183
    182184defining the Mapping:
    183185\begin{enumerate}
     
    217219AF + DCR + RR
    218220
    219 
    220 
    221 
     221\section{Summary}
     222
     223
     224
  • SMC4LRT/chapters/System.tex

    r2703 r3140  
    6262\todocode{install Jena +  fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site}
    6363
     64\todocode{check install siren}\furl{http://siren.sindice.com/}
     65\todocode{check install Virtuoso}\furl{http://ods.openlinksw.com/wiki/ODS/}
     66\todocode{check install Neo4J}
     67\todocode{check install ontology browser}
     68
     69semantic search component in the Linked Media Framework
     70\todocode{!!! check install LMF - kiwi - SemanticSearch !!!}\furl{http://code.google.com/p/kiwi/wiki/SemanticSearch}
     71
     72\todoin{check SARQ}\furl{http://github.com/castagna/SARQ}
     73
    6474\todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?}
    6575
     
    6777\section{User Interface?}
    6878
    69 \subsection{Query Input}
     79\subsection*{Query Input}
    7080
    71 \subsection{Columns}
     81\subsection*{Columns}
    7282
    73 \subsection{Summaries}
     83\subsection*{Summaries}
    7484
    75 \subsection{Differential Views}
     85\subsection*{Differential Views}
    7686Visualize impact of given mapping in terms of covered dataset (number of matched records).
    7787
    78 \subsection{Visualization}
     88\subsection*{Visualization}
    7989Landscape, Treemap, SOM
    8090
Note: See TracChangeset for help on using the changeset viewer.