Context Navigation

← Previous Changeset
Next Changeset →

Changeset 3140

Timestamp:

07/15/13 19:02:41 (11 years ago)

Author:

vronk

Message:

update chapters
added some info about META-SHARE, TEI, Data in general

Location:

SMC4LRT/chapters

Files:

: 8 edited

Data.tex (modified) (5 diffs)
Definitions.tex (modified) (3 diffs)
Evaluation.tex (modified) (1 diff)
Infrastructure.tex (modified) (7 diffs)
Introduction.tex (modified) (3 diffs)
Literature.tex (modified) (1 diff)
SMC.tex (modified) (2 diffs)
System.tex (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

SMC4LRT/chapters/Data.tex

-                      r2704
+                      r3140
 \subsection{CMD-Framework}
+\subsubsection{CMD Profiles }
+In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
+Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements
+(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
+\begin{table}
+\caption{The development of defined profiles and DCs over time}
+\label{table:dev}
+  \begin{tabular}{ l | r | r | r | r }
+    \hline
+date     & 2011-01 & 2012-06 & 2013-01 & 2013-06  \\
+    \hline
+Profiles & 40 & 53 & 87 & 124 \\
+Distinct Components & 164 & 298 & 542 & 828 \\
+Expanded Components & 1055 & 1536 & 2904 & 5757 \\
+Distinct Elements & 511 & 893 & 1505 & 2399 \\
+Expanded Elements & 1971 & 3030 & 5754 & 13232 \\
+Distinct data categories & 203 & 266 & 436 & 499 \\
+Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\
+Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% \\
+Components with DCs & 28 & 67 & 115 & 140 \\
+    \hline
+  \end{tabular}
+\end{table}
+\subsection{Instance Data}
+\todoin{probably more numbers about CMD records (collections, used profiles, ...) (in historical perspective?)}
+On the instance level, in the harvested data 60 distinct profiles can be found.
+The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
+collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records.
+of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
+On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
+\begin{table}
+\caption{Top 20 profiles, with the respective number of records}
 \begin{center}
+  \begin{tabular}{ l | r }
+    \hline
+created & 2013-01-26 \\ \hline
+Profiles & 87 \\
+Components & 2904 \\
+distinct Components & 542 \\
+Elements & 5754 \\
+distinct Elements & 1505 \\
+distinct DatCats & 436 \\
+Elements with DatCats & 1183 \\
+Elements without DatCats & 323 \\
+ratio of elements without DatCats & 21.46 \% \\
+available Concepts & 893 \\
+used Concept & 474 \\
+blind Concepts (not in public ISOcat) & 190 \\
+Concepts not used in CMD & 539 \\
+  \begin{tabular}{ r | l }
+\# records & profile \\
+    \hline
+.403 & Song \\
+.257 & Session \\
+.996 & OLAC-DcmiTerms \\
+.156 & DcmiTerms \\
+.448 & SongScan \\
+.256 & SourceScan \\
+.059 & LiteraryCorpusProfile \\
+& Source \\
+& imdi-corpus \\
+& media-session-profile \\
+& SongAudio \\
+& SymbolicMusicNotation \\
+& LCC DataProviderProfile \\
+& SourceProfile \\
+& Text \\
+& Soundbites-recording \\
+& Performer \\
+& ArthurianFiction \\
+& LrtInventoryResource \\
+& teiHeader \\
     \hline
   \end{tabular}
 \end{center}
+\todoin{Collect number about CMD-Framework (profiles, datcats) + historical development}
+\todoin{Collect numbers about CMD records (collections, used profiles, ...) in historical perspective}
+\end{table}
+We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).
 …
 DC, OLAC
+openarchives register: \url{http://www.openarchives.org/Register/BrowseSites}
+OAI-repositories
 DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/}
+DublinCore to RDF mapping\furl{http://dublincore.org/documents/dcq-rdf-xml/}
+\label{def:OLAC}
+A more specific version of the dublincore terms, adapted to the needs of the linguistic community is the
+OLAC\furl{http://www.language-archives.org/}format\cite{Bird2001}
+OLAC \cite{Simons2003OLAC}.
+\todoin{check http://www.language-archives.org/OLAC/metadata.html}
+\begin{quotation}
+The OLAC metadata set is the set of metadata elements that participating archives have agreed to use for describing language resources. Uniform description across archives is ensured by limiting the values of certain metadata elements to the use of terms from agreed-upon controlled vocabularies. The OLAC metadata set is equally applicable whether the resources are available online or not. The metadata set consists of the fifteen elements of the Dublin Core Metadata Set, plus the refinements and encoding schemes of the DCMI Metadata Termsâa widely accepted standard for describing resources of all types. To this general standard, OLAC adds encoding schemes that are designed specifically for describing language resources, such as subject language and linguistic data type. The OLAC Metadata Usage Guidelines describe (with examples) all the elements, refinements, and encoding schemes that may be used in OLAC metadata descriptions. The OLAC Metadata standard defines the XML format that is used for the interchange of metadata descriptions among participating archives.
+\end{quotation}
 \subsection{TEI / teiHeader}
 …
 \subsection{Europeana Data Model - EDM}
+\subsection{META-SHARE}
+META-SHARE is another multinational project aiming to build an infrastructure for language resource\cite{Piperidis2012meta}, however focusing more on Human Language Technologies domain.\furl{http://meta-share.eu}
+\begin{quotation}
+META-NET is designing and implementing META-SHARE, a sustainable network of repositories of language data, tools and related web services documented with high-quality metadata, aggregated in central inventories allowing for uniform search and access to resources. Data and tools can be both open and with restricted access rights, free and for-a-fee. META-SHARE targets existing but also new and emerging language data, tools and systems required for building and evaluating new technologies, products and services.
+\end{quotation}
+\begin{quotation}
+META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
+META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
+\end{quotation}
+The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
+A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
+The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
+MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
 …
 AAT - international Architecture and Arts Thesaurus
 GND - Gemeinsame Norm Datei
+GND - Gemeinsame Norm Datei (GND ontology\furl{http://d-nb.info/standards/elementset/gnd}
 GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
 VIAF - Virtual International Authority File
 …
 \section{Other Metadata Catalogs/Collections}
+Digital Libraries
+\subsubsection{(Digital) Libraries}
+\subsection{(Digital) Libraries}

SMC4LRT/chapters/Definitions.tex

-                      r2703
+                      r3140
 \chapter{Definitions}
+Meanings of ``mapping'':
+\begin{itemize}
+\item transform
+\item match (schemas)
+\item  overview (browser)
+\end{itemize}
 \section {Namespaces}
 …
 \section {Abbreviations}
+\begin{description}
+\item[CLARIN] \textit{Common Language Resources and Technology Infrastructure} \ref{def:CLARIN}
+\item[CLAVAS] \textit{Vocabulary Alignement Service for CLARIN} \ref{def:CLAVAS}
+\item[CMD] \textit{Component Metadata} \ref{def:CMD}
+\item[CMDI] \textit{Component Metadata Infrastructure} \ref{def:CMDI}
+\item[ERIC] \textit{European Research Infrastructure  Consortium} - a legal entity for long-term research infrastructure initiatives
+\item[DC] data category
+\item[DCR] data category registry \cite{ISO12620:2009}
+\item[OLAC] \textit{Open Language Archive Community}\furl{http://www.language-archives.org/}\ref{def:OLAC}
+\item[PID] persistend identifier \todocite{PID}
+\item[PURL] persistent uniform resource locator \todocite{PURL}
+\item[RDF] \textit{Resource Description Framework} \todocite{RDF}
+\item[RR] Relation Registry\ref{def:rr}
+\item[TEI] \textit{Text Encoding Initiative}
+\end{description}
 \section {Terms}
 …
 \begin{description}
 \item[Concept]  sense, idea, philosophical problem, which we don't need to discuss here. For our purposes we say: Basic "entity" in an ontology? that of what an ontology is build
 \item[Ontology]  ``an explicit specification of a conceptualization'' \todoin{cite!}, but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
+\item[Concept]  Basic "entity" in an ontology? that of what an ontology is build
+\item[Ontology]  ``an explicit specification of a conceptualization'' \todocite {Ontology!}, but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
 \item[Word]  a lexical unit, a word in a language, something that has a surface realization (writtenForm) and is a carrier of sense. so a relation holds: hasSense(Word, Concept)
 \item[Lexicon]  a collection of words, a (lexical) vocabulary

SMC4LRT/chapters/Evaluation.tex

-                      r2672
+                      r3140
 Project, Institution, Person, Publisher
+\section{Usability}
+\section{Exploring Data Categories}
+In the ISOcat DCR 791 DCss are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed} In the following we describe two show cases -- \textit{Language} and \textit{name} -- in more detail.
+\subsection{Language}
+While there are 69 components and 97 elements containing a substring `language' defined in the CR
+still only 19 distinct DCs with a `language' substring are being used\footnote{Here the term `used' means referenced in CMD components and elements.}. The most commonly used ones:
+\textit{languageID} (\texttt{DC-2482}) and \textit{languageName} (\texttt{DC-2484}) are referenced by more than 80 profiles.
+Additionally, these two DCs are linked to the Dublin Core term \textit{Language} in the RR.
+Thus a search engine capable of interpreting RR information could offer the user a simple Dublin Core-based search interface, while -- by expanding the query -- still searching over all available data, and, moreover, on demand offer the user a more finegrained semantic interpretation for the matches based on the originally assigned DCs. Figure \ref{fig:language_datcats} depicts the relations between the language data categories and their usage in the profiles. We encounter all types of situations: profiles using only \textit{dc:Language} or \textit{dcterms:Language}, \textit{isocat:languageId} or \textit{isocat:languageName},
+most profiles use both \textit{isocat:languageId} and \textit{isocat:languageName} and there are even profiles that refer to both \textit{isocat} and \textit{dublincore} data categories (\textit{data}, \textit{HZSKCorpus}, \textit{ToolService}).
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf}
+\end{center}
+\caption{The four main \textit{Language} data categories and in which profiles they are being used}
+\label{fig:language_datcats}
+\end{figure*}
+It requires further inspection and in the end a case by case decision, if the other less often used `language' DCs can be treated as equivalent to the above mentioned ones.
+\textit{languageScript}, \textit{implementationLanguage}, as well as \textit{noLanguages} or  \textit{sizePerLanguage} clearly do not belong to the language cluster.
+But \textit{sourceLanguage}, \textit{languageMother} or \textit{participantDominantLanguage} can at least be expected to share the same value domain (natural languages) and even if they do not describe the language of the resource, they could be considered when one aims at maximizing the recall (i.e., trying to find anything related to a given language). This is actually exactly the scenario the RR was conceived for -- allow to define custom relation sets based on specific needs of a project or of a research question.
+\subsection{Name}
+There are as many as 72 CMD elements with the label \texttt{Name}, referring to 12 different DCs.
+Again the main DC \textit{resourceName} (\texttt{DC-2544}) being used in 74 profiles together with the semantically near \textit{resourceTitle} (\texttt{DC-2545}) used in 69 profiles offer a good coverage over available data.
+Some of the DCs referenced by \texttt{Name} elements are \textit{author} (\texttt{DC-4115}), \textit{contact full name} (\texttt{DC-2454}), \textit{dcterms:Contributor}, \textit{project name} (\texttt{DC-2536}), \textit{web service name} (\texttt{DC-4160}) and \textit{language name} (\texttt{DC-2484}). This implies, that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.
+\subsection{Resource type}
+\subsection{Subject, Genre, Topic}
+\section{Mapping existing Formats}
+\subsection{dublincore / OLAC}
+Very widely used format
+\ref{info:olac-records}
+There are 4-5 CMD profiles modelling OLAC/dcmi-terms
+\subsection{teiHeader}
+TEI is a de-facto standard for encoding any kind of textual resources. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description / metadata the complex element \code{teiHeader} is foreseen.
+TEI does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs.
+Thus there is also not just one fixed \xne{teiHeader}.
+The widespread use of TEI for encoding textual resources  brings about a strong interest of multiple research teams of the CLARIN community to integrate TEI with CMDI. There was a first attempt already in 2010, modelling the recommended \xne{teiHeader}\furl{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html\#HD7}, encoding \xne{fileDesc} and \xne{profileDesc} components, leaving out \xne{encodingDesc} and \xne{revisionDesc}. The leaf elements were bound to the most prominent data categories, making it a mixture of both dublincore and isocat.
+The large research project \xne{Deutsches Text Archiv}\furl{http://deutschestextarchiv.de/}\todocite{DTA}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information.
+\todoin{Why a separate cmd-profile}
+\xne{Nederlab} is another large-scale project concerned with \todoin{dutch? historic texts}, starting 2013 in Netherlands\todocite{Nederlab}. Within this project another set of CMD profiles was created, however reusing existing components.
+As seen in figure \ref{fig:teiHeadeer_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added.
+Another approach was applied within the context of other CLARIN-NL projects, \todocite{Windhouwer, 2012} generated, based on an ODD-file, a data category for every element of the teiHeader (135 datcats) creating a dedicated data category selection: \xne{TEI Header (2.1.0)}. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:components}. The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.
+This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question.
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.75\textwidth]{images/teiHeader_DBNL.png}
+\end{center}
+\caption{The reuse of components between the original teiHeader-profile (2010) and the profiels used in Nederlab project}
+\label{fig:teiHeader_DBNL}
+\end{figure*}
+\begin{table}
+\caption{Overview of TEI-related CMD profiles}
+\label{table:tei-profiles}
+  \begin{tabular}{ l | r | l | r | r | r}
+    \hline
+project, author & created & profile name & comp elem datcats & instances \\
+    \hline
+Deutsches Text Archiv & 2012 & teiHeader & 56/82/10 & 857 \\
+ICLTT, Durco & 2010 & teiHeader & 16/35/13 & 467 \\
+Leipzig Corpora, Eckart & 2012 & TEIDocumentDescription & 16/35/13 & ? \\
+Nederlab, Zhang & 2013 & DBNL\_Tekst & 20/38,15 & ? \\
+  & & DBNL\_Tekst\_Onzelfstandig & 20/47/21 & ? \\
+    \hline
+  \end{tabular}
+\end{table}
+\todoin{DBNL\_Tekst\_Onzelfstandig - how many instances?}
+DBNL\_Tekst clarin.eu:cr1:p\_1361876010678,
+clarin.eu:cr1:p 1366279029218 (private)
+\subsection{META-SHARE}
+META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
+%In cooperation between metadata teams from CLARIN and META-SHARE
+The model has been expressed as 4 CMD profiles for distinct resource types sharing most of the components. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 419 components and 1587 elements (when expanded). Although most of the elements are optional
+resourceInfo    419     1587    72      790     797     50.22 %
+\todoin{how many distinct components/elements}
+This? shows nicely the trade-off between the two different approaches between CMD and META-SHARE: many custom schemas or one very large.
+In a parallel effort, LINDAT, the czech national infrastructure initiative with ties to both CLARIN and META-SHARE, created a CMD profile modelling the minimal obligatory set of META-SHARE. combined with dublincore.
+So the information is partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema
+resourceInfo    65      92      21      82      10      10.87 %
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.75\textwidth]{images/LINDAT-profile-data.png}
+\end{center}
+\caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
+\label{fig:META-SHARE-LINDAT}
+\end{figure*}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[height=1\textheight]{images/resourceInfoBIG.png}
+\end{center}
+\caption{the META-SHARE based profile for describing corpora}
+\label{fig:META-SHARE-BIG}
+\end{figure*}
+%\section{Usability}

SMC4LRT/chapters/Infrastructure.tex

-                      r2703
+                      r3140
 \section{CLARIN / CMDI}
+\label{def:CLARIN}
 CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is to
 …
 \item Data Category Registry
 \item Relation Registry
 \item Schema Registry
+\item Schema Registry (SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html})
 \item Component Registry
 \item Vocabulary Alignement Service (OpenSKOS)
 …
 The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
 However there needs to be an additional means to capture information about relations between data categories.
 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler.
+This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler.
 % These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
 …
 from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
 Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this novel mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{method}.
+Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{method}.
 \subsection{Vocabulary Service / Reference Data Registry}
 …
 The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search}
+More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search}
 \todocite {MI Search Engine}
 …
 \section{Content Repositories}
 Metadata is only one aspect of the availability of resources. It is the first step to announce and describe the resources. However it is of little value, if the resources themselves are not equally well accessible. Thus another pillar of the CLARIN infrastructure are Content Repositories - centres to ensure availability of resources.
+RDF-stores in Content Repositories (Fedora, ..)
 The requirements for these repositories: PIDs, CMD, OAI-PMH
 …
 \item[OAI-PMH]
 \end{description}
+\section{Summary}

SMC4LRT/chapters/Introduction.tex

-                      r2703
+                      r3140
 \subsection{Problem statement}
 While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.
+While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by influence from different schools of thought. (cf. \ref{ch:data})
 \todoin{Need some number about the disparity in the field, number of institutes, resources, formats.}
 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
+This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process has gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
 …
 Finally, in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation
 in which we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision indicators. A separate evaluation of the usability of the Semantic Search component  is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work.
+in which we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures. A separate evaluation of the usability of the Semantic Search component  is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work.
 \begin{itemize}
 …
 This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components and the results and findings of the \emph{evaluation}.
 One promising by-product of the work will be the original dataset expressed as RDF with links into existing external  resources (ontologies, knowledgebases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\footnote{\url{http://linkeddata.org/}} in the \emph{Web of Data}.
+One promising by-product of the work will be the original dataset expressed as RDF with links into existing external  resources (ontologies, knowledgebases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/} in the \emph{Web of Data}.

SMC4LRT/chapters/Literature.tex

-                      r2703
+                      r3140
 This is very much in line with the broad \textit{Linked Open Data} effort as proposed by Berners-Lee \cite{TimBL2006} and being pursuit across many discplines. (This topic is supported also by the EU Commission within the FP7.\footnote{\url{http://cordis.europa.eu/fetch?CALLER=PROJ\_ICT&ACTION=D&CAT=PROJ&RCN=95562}}) A very recent comprehensive overview of the principles of Linked Data and current applications is the book by Heath and Bizer \cite{HeathBizer2011}, that shall serve as a practical guide for this specific task.
+Formate:
+Turtle \furl{http://www.w3.org/TeamSubmission/turtle/\#sec-grammar-comments}
+RDFa\furl{http://en.wikipedia.org/wiki/RDFa}
+EDM\furl{http://europeana.ontotext.com/resource/edm/hasType?role=all}
+\todocite{http://ldl2012.lod2.eu/program/proceedings}
+\todoin{check LDpath}\furl{http://code.google.com/p/ldpath/}
 \subsection{Schema / Ontology Mapping}

SMC4LRT/chapters/SMC.tex

-                      r2703
+                      r3140
 \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
+\todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
 defining the Mapping:
 \begin{enumerate}
 …
 AF + DCR + RR
+\section{Summary}

SMC4LRT/chapters/System.tex

-                      r2703
+                      r3140
 \todocode{install Jena +  fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site}
+\todocode{check install siren}\furl{http://siren.sindice.com/}
+\todocode{check install Virtuoso}\furl{http://ods.openlinksw.com/wiki/ODS/}
+\todocode{check install Neo4J}
+\todocode{check install ontology browser}
+semantic search component in the Linked Media Framework
+\todocode{!!! check install LMF - kiwi - SemanticSearch !!!}\furl{http://code.google.com/p/kiwi/wiki/SemanticSearch}
+\todoin{check SARQ}\furl{http://github.com/castagna/SARQ}
 \todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?}
 …
 \section{User Interface?}
 \subsection{Query Input}
+\subsection*{Query Input}
 \subsection{Columns}
+\subsection*{Columns}
 \subsection{Summaries}
+\subsection*{Summaries}
 \subsection{Differential Views}
+\subsection*{Differential Views}
 Visualize impact of given mapping in terms of covered dataset (number of matched records).
 \subsection{Visualization}
+\subsection*{Visualization}
 Landscape, Treemap, SOM

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 3140

Legend:

Download in other formats: