Changeset 3140
- Timestamp:
- 07/15/13 19:02:41 (11 years ago)
- Location:
- SMC4LRT/chapters
- Files:
-
- 8 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/chapters/Data.tex
r2704 r3140 10 10 \subsection{CMD-Framework} 11 11 12 13 14 \subsubsection{CMD Profiles } 15 In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time. 16 17 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements 18 (when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}). 19 20 21 \begin{table} 22 \caption{The development of defined profiles and DCs over time} 23 \label{table:dev} 24 \begin{tabular}{ l | r | r | r | r } 25 \hline 26 date & 2011-01 & 2012-06 & 2013-01 & 2013-06 \\ 27 \hline 28 Profiles & 40 & 53 & 87 & 124 \\ 29 Distinct Components & 164 & 298 & 542 & 828 \\ 30 Expanded Components & 1055 & 1536 & 2904 & 5757 \\ 31 Distinct Elements & 511 & 893 & 1505 & 2399 \\ 32 Expanded Elements & 1971 & 3030 & 5754 & 13232 \\ 33 Distinct data categories & 203 & 266 & 436 & 499 \\ 34 Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\ 35 Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% \\ 36 Components with DCs & 28 & 67 & 115 & 140 \\ 37 38 \hline 39 \end{tabular} 40 \end{table} 41 42 43 \subsection{Instance Data} 44 45 46 \todoin{probably more numbers about CMD records (collections, used profiles, ...) (in historical perspective?)} 47 48 On the instance level, in the harvested data 60 distinct profiles can be found. 49 50 The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}} 51 collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records. 52 16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152. 53 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles. 54 55 56 \begin{table} 57 \caption{Top 20 profiles, with the respective number of records} 12 58 \begin{center} 13 \begin{tabular}{ l | r } 14 \hline 15 created & 2013-01-26 \\ \hline 16 Profiles & 87 \\ 17 Components & 2904 \\ 18 distinct Components & 542 \\ 19 Elements & 5754 \\ 20 distinct Elements & 1505 \\ 21 distinct DatCats & 436 \\ 22 Elements with DatCats & 1183 \\ 23 Elements without DatCats & 323 \\ 24 ratio of elements without DatCats & 21.46 \% \\ 25 available Concepts & 893 \\ 26 used Concept & 474 \\ 27 blind Concepts (not in public ISOcat) & 190 \\ 28 Concepts not used in CMD & 539 \\ 59 \begin{tabular}{ r | l } 60 \# records & profile \\ 61 \hline 62 155.403 & Song \\ 63 138.257 & Session \\ 64 92.996 & OLAC-DcmiTerms \\ 65 46.156 & DcmiTerms \\ 66 28.448 & SongScan \\ 67 21.256 & SourceScan \\ 68 19.059 & LiteraryCorpusProfile \\ 69 16519 & Source \\ 70 13626 & imdi-corpus \\ 71 10610 & media-session-profile \\ 72 7961 & SongAudio \\ 73 7557 & SymbolicMusicNotation \\ 74 4485 & LCC DataProviderProfile \\ 75 4485 & SourceProfile \\ 76 4417 & Text \\ 77 1982 & Soundbites-recording \\ 78 1530 & Performer \\ 79 1475 & ArthurianFiction \\ 80 939 & LrtInventoryResource \\ 81 873 & teiHeader \\ 29 82 \hline 30 83 \end{tabular} 31 84 \end{center} 32 33 \todoin{Collect number about CMD-Framework (profiles, datcats) + historical development} 34 35 \todoin{Collect numbers about CMD records (collections, used profiles, ...) in historical perspective} 85 \end{table} 86 87 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource). 36 88 37 89 … … 40 92 DC, OLAC 41 93 94 openarchives register: \url{http://www.openarchives.org/Register/BrowseSites} 95 2006 OAI-repositories 96 42 97 DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/} 98 99 DublinCore to RDF mapping\furl{http://dublincore.org/documents/dcq-rdf-xml/} 100 101 \label{def:OLAC} 102 A more specific version of the dublincore terms, adapted to the needs of the linguistic community is the 103 OLAC\furl{http://www.language-archives.org/}format\cite{Bird2001} 104 105 OLAC \cite{Simons2003OLAC}. 106 107 \todoin{check http://www.language-archives.org/OLAC/metadata.html} 108 109 \begin{quotation} 110 The OLAC metadata set is the set of metadata elements that participating archives have agreed to use for describing language resources. Uniform description across archives is ensured by limiting the values of certain metadata elements to the use of terms from agreed-upon controlled vocabularies. The OLAC metadata set is equally applicable whether the resources are available online or not. The metadata set consists of the fifteen elements of the Dublin Core Metadata Set, plus the refinements and encoding schemes of the DCMI Metadata Termsâa widely accepted standard for describing resources of all types. To this general standard, OLAC adds encoding schemes that are designed specifically for describing language resources, such as subject language and linguistic data type. The OLAC Metadata Usage Guidelines describe (with examples) all the elements, refinements, and encoding schemes that may be used in OLAC metadata descriptions. The OLAC Metadata standard defines the XML format that is used for the interchange of metadata descriptions among participating archives. 111 \end{quotation} 112 113 114 43 115 44 116 \subsection{TEI / teiHeader} … … 50 122 51 123 \subsection{Europeana Data Model - EDM} 124 125 \subsection{META-SHARE} 126 META-SHARE is another multinational project aiming to build an infrastructure for language resource\cite{Piperidis2012meta}, however focusing more on Human Language Technologies domain.\furl{http://meta-share.eu} 127 128 \begin{quotation} 129 META-NET is designing and implementing META-SHARE, a sustainable network of repositories of language data, tools and related web services documented with high-quality metadata, aggregated in central inventories allowing for uniform search and access to resources. Data and tools can be both open and with restricted access rights, free and for-a-fee. META-SHARE targets existing but also new and emerging language data, tools and systems required for building and evaluating new technologies, products and services. 130 \end{quotation} 131 132 \begin{quotation} 133 META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role. 134 135 META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.). 136 137 \end{quotation} 138 139 The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource. 140 141 A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users. 142 The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes). 143 144 145 MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology} 52 146 53 147 … … 84 178 85 179 AAT - international Architecture and Arts Thesaurus 86 GND - Gemeinsame Norm Datei 180 GND - Gemeinsame Norm Datei (GND ontology\furl{http://d-nb.info/standards/elementset/gnd} 87 181 GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives) 88 182 VIAF - Virtual International Authority File … … 147 241 \section{Other Metadata Catalogs/Collections} 148 242 149 Digital Libraries 150 \subsubsection{(Digital) Libraries} 243 \subsection{(Digital) Libraries} 151 244 152 245 -
SMC4LRT/chapters/Definitions.tex
r2703 r3140 1 1 \chapter{Definitions} 2 3 Meanings of ``mapping'': 4 \begin{itemize} 5 \item transform 6 \item match (schemas) 7 \item overview (browser) 8 \end{itemize} 9 2 10 3 11 \section {Namespaces} … … 11 19 \section {Abbreviations} 12 20 21 \begin{description} 22 \item[CLARIN] \textit{Common Language Resources and Technology Infrastructure} \ref{def:CLARIN} 23 \item[CLAVAS] \textit{Vocabulary Alignement Service for CLARIN} \ref{def:CLAVAS} 24 \item[CMD] \textit{Component Metadata} \ref{def:CMD} 25 \item[CMDI] \textit{Component Metadata Infrastructure} \ref{def:CMDI} 26 \item[ERIC] \textit{European Research Infrastructure Consortium} - a legal entity for long-term research infrastructure initiatives 27 \item[DC] data category 28 \item[DCR] data category registry \cite{ISO12620:2009} 29 \item[OLAC] \textit{Open Language Archive Community}\furl{http://www.language-archives.org/}\ref{def:OLAC} 30 \item[PID] persistend identifier \todocite{PID} 31 \item[PURL] persistent uniform resource locator \todocite{PURL} 32 \item[RDF] \textit{Resource Description Framework} \todocite{RDF} 33 \item[RR] Relation Registry\ref{def:rr} 34 \item[TEI] \textit{Text Encoding Initiative} 35 \end{description} 13 36 14 37 \section {Terms} … … 17 40 18 41 \begin{description} 19 \item[Concept] sense, idea, philosophical problem, which we don't need to discuss here. For our purposes we say:Basic "entity" in an ontology? that of what an ontology is build20 \item[Ontology] ``an explicit specification of a conceptualization'' \todo in{cite!}, but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.42 \item[Concept] Basic "entity" in an ontology? that of what an ontology is build 43 \item[Ontology] ``an explicit specification of a conceptualization'' \todocite {Ontology!}, but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words. 21 44 \item[Word] a lexical unit, a word in a language, something that has a surface realization (writtenForm) and is a carrier of sense. so a relation holds: hasSense(Word, Concept) 22 45 \item[Lexicon] a collection of words, a (lexical) vocabulary -
SMC4LRT/chapters/Evaluation.tex
r2672 r3140 27 27 Project, Institution, Person, Publisher 28 28 29 \section{Usability} 29 30 \section{Exploring Data Categories} 31 In the ISOcat DCR 791 DCss are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed} In the following we describe two show cases -- \textit{Language} and \textit{name} -- in more detail. 32 33 \subsection{Language} 34 While there are 69 components and 97 elements containing a substring `language' defined in the CR 35 still only 19 distinct DCs with a `language' substring are being used\footnote{Here the term `used' means referenced in CMD components and elements.}. The most commonly used ones: 36 \textit{languageID} (\texttt{DC-2482}) and \textit{languageName} (\texttt{DC-2484}) are referenced by more than 80 profiles. 37 Additionally, these two DCs are linked to the Dublin Core term \textit{Language} in the RR. 38 Thus a search engine capable of interpreting RR information could offer the user a simple Dublin Core-based search interface, while -- by expanding the query -- still searching over all available data, and, moreover, on demand offer the user a more finegrained semantic interpretation for the matches based on the originally assigned DCs. Figure \ref{fig:language_datcats} depicts the relations between the language data categories and their usage in the profiles. We encounter all types of situations: profiles using only \textit{dc:Language} or \textit{dcterms:Language}, \textit{isocat:languageId} or \textit{isocat:languageName}, 39 most profiles use both \textit{isocat:languageId} and \textit{isocat:languageName} and there are even profiles that refer to both \textit{isocat} and \textit{dublincore} data categories (\textit{data}, \textit{HZSKCorpus}, \textit{ToolService}). 40 41 42 \begin{figure*}[!ht] 43 \begin{center} 44 \includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf} 45 \end{center} 46 \caption{The four main \textit{Language} data categories and in which profiles they are being used} 47 \label{fig:language_datcats} 48 \end{figure*} 49 50 It requires further inspection and in the end a case by case decision, if the other less often used `language' DCs can be treated as equivalent to the above mentioned ones. 51 \textit{languageScript}, \textit{implementationLanguage}, as well as \textit{noLanguages} or \textit{sizePerLanguage} clearly do not belong to the language cluster. 52 But \textit{sourceLanguage}, \textit{languageMother} or \textit{participantDominantLanguage} can at least be expected to share the same value domain (natural languages) and even if they do not describe the language of the resource, they could be considered when one aims at maximizing the recall (i.e., trying to find anything related to a given language). This is actually exactly the scenario the RR was conceived for -- allow to define custom relation sets based on specific needs of a project or of a research question. 53 54 55 \subsection{Name} 56 There are as many as 72 CMD elements with the label \texttt{Name}, referring to 12 different DCs. 57 Again the main DC \textit{resourceName} (\texttt{DC-2544}) being used in 74 profiles together with the semantically near \textit{resourceTitle} (\texttt{DC-2545}) used in 69 profiles offer a good coverage over available data. 58 59 Some of the DCs referenced by \texttt{Name} elements are \textit{author} (\texttt{DC-4115}), \textit{contact full name} (\texttt{DC-2454}), \textit{dcterms:Contributor}, \textit{project name} (\texttt{DC-2536}), \textit{web service name} (\texttt{DC-4160}) and \textit{language name} (\texttt{DC-2484}). This implies, that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values. 60 61 \subsection{Resource type} 62 63 \subsection{Subject, Genre, Topic} 64 65 \section{Mapping existing Formats} 66 67 \subsection{dublincore / OLAC} 68 69 Very widely used format 70 \ref{info:olac-records} 71 72 There are 4-5 CMD profiles modelling OLAC/dcmi-terms 73 74 75 76 \subsection{teiHeader} 77 78 TEI is a de-facto standard for encoding any kind of textual resources. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description / metadata the complex element \code{teiHeader} is foreseen. 79 TEI does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. 80 Thus there is also not just one fixed \xne{teiHeader}. 81 82 The widespread use of TEI for encoding textual resources brings about a strong interest of multiple research teams of the CLARIN community to integrate TEI with CMDI. There was a first attempt already in 2010, modelling the recommended \xne{teiHeader}\furl{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html\#HD7}, encoding \xne{fileDesc} and \xne{profileDesc} components, leaving out \xne{encodingDesc} and \xne{revisionDesc}. The leaf elements were bound to the most prominent data categories, making it a mixture of both dublincore and isocat. 83 84 The large research project \xne{Deutsches Text Archiv}\furl{http://deutschestextarchiv.de/}\todocite{DTA}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information. 85 \todoin{Why a separate cmd-profile} 86 87 \xne{Nederlab} is another large-scale project concerned with \todoin{dutch? historic texts}, starting 2013 in Netherlands\todocite{Nederlab}. Within this project another set of CMD profiles was created, however reusing existing components. 88 As seen in figure \ref{fig:teiHeadeer_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added. 89 90 Another approach was applied within the context of other CLARIN-NL projects, \todocite{Windhouwer, 2012} generated, based on an ODD-file, a data category for every element of the teiHeader (135 datcats) creating a dedicated data category selection: \xne{TEI Header (2.1.0)}. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:components}. The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs. 91 This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question. 92 93 \begin{figure*}[!ht] 94 \begin{center} 95 \includegraphics[width=0.75\textwidth]{images/teiHeader_DBNL.png} 96 \end{center} 97 \caption{The reuse of components between the original teiHeader-profile (2010) and the profiels used in Nederlab project} 98 \label{fig:teiHeader_DBNL} 99 \end{figure*} 100 101 \begin{table} 102 \caption{Overview of TEI-related CMD profiles} 103 \label{table:tei-profiles} 104 \begin{tabular}{ l | r | l | r | r | r} 105 \hline 106 project, author & created & profile name & comp elem datcats & instances \\ 107 \hline 108 Deutsches Text Archiv & 2012 & teiHeader & 56/82/10 & 857 \\ 109 ICLTT, Durco & 2010 & teiHeader & 16/35/13 & 467 \\ 110 Leipzig Corpora, Eckart & 2012 & TEIDocumentDescription & 16/35/13 & ? \\ 111 Nederlab, Zhang & 2013 & DBNL\_Tekst & 20/38,15 & ? \\ 112 & & DBNL\_Tekst\_Onzelfstandig & 20/47/21 & ? \\ 113 \hline 114 \end{tabular} 115 \end{table} 116 117 \todoin{DBNL\_Tekst\_Onzelfstandig - how many instances?} 118 119 DBNL\_Tekst clarin.eu:cr1:p\_1361876010678, 120 clarin.eu:cr1:p 1366279029218 (private) 121 122 \subsection{META-SHARE} 123 124 125 META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components. 126 %In cooperation between metadata teams from CLARIN and META-SHARE 127 The model has been expressed as 4 CMD profiles for distinct resource types sharing most of the components. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 419 components and 1587 elements (when expanded). Although most of the elements are optional 128 129 resourceInfo 419 1587 72 790 797 50.22 % 130 \todoin{how many distinct components/elements} 131 This? shows nicely the trade-off between the two different approaches between CMD and META-SHARE: many custom schemas or one very large. 132 133 In a parallel effort, LINDAT, the czech national infrastructure initiative with ties to both CLARIN and META-SHARE, created a CMD profile modelling the minimal obligatory set of META-SHARE. combined with dublincore. 134 So the information is partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema 135 136 resourceInfo 65 92 21 82 10 10.87 % 137 138 139 \begin{figure*}[!ht] 140 \begin{center} 141 \includegraphics[width=0.75\textwidth]{images/LINDAT-profile-data.png} 142 \end{center} 143 \caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements } 144 \label{fig:META-SHARE-LINDAT} 145 \end{figure*} 146 147 \begin{figure*}[!ht] 148 \begin{center} 149 \includegraphics[height=1\textheight]{images/resourceInfoBIG.png} 150 \end{center} 151 \caption{the META-SHARE based profile for describing corpora} 152 \label{fig:META-SHARE-BIG} 153 \end{figure*} 154 155 156 157 %\section{Usability} -
SMC4LRT/chapters/Infrastructure.tex
r2703 r3140 4 4 5 5 \section{CLARIN / CMDI} 6 6 \label{def:CLARIN} 7 7 CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is to 8 8 … … 20 20 \item Data Category Registry 21 21 \item Relation Registry 22 \item Schema Registry 22 \item Schema Registry (SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html}) 23 23 \item Component Registry 24 24 \item Vocabulary Alignement Service (OpenSKOS) … … 47 47 The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions. 48 48 However there needs to be an additional means to capture information about relations between data categories. 49 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler.49 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler. 50 50 % These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed. 51 51 … … 68 68 from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}. 69 69 70 Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this novelmechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{method}.70 Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{method}. 71 71 72 72 \subsection{Vocabulary Service / Reference Data Registry} … … 321 321 The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings. 322 322 323 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\f ootnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search}323 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search} 324 324 \todocite {MI Search Engine} 325 325 … … 330 330 \section{Content Repositories} 331 331 Metadata is only one aspect of the availability of resources. It is the first step to announce and describe the resources. However it is of little value, if the resources themselves are not equally well accessible. Thus another pillar of the CLARIN infrastructure are Content Repositories - centres to ensure availability of resources. 332 333 RDF-stores in Content Repositories (Fedora, ..) 332 334 333 335 The requirements for these repositories: PIDs, CMD, OAI-PMH … … 344 346 \item[OAI-PMH] 345 347 \end{description} 348 349 \section{Summary} -
SMC4LRT/chapters/Introduction.tex
r2703 r3140 19 19 \subsection{Problem statement} 20 20 21 While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.21 While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by influence from different schools of thought. (cf. \ref{ch:data}) 22 22 23 23 \todoin{Need some number about the disparity in the field, number of institutes, resources, formats.} 24 24 25 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to havegained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.25 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process has gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN. 26 26 27 27 … … 54 54 55 55 Finally, in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation 56 in which we apply a set of test queries and compare a traditional search with a semantically expanded query in terms of recall/precision indicators. A separate evaluation of the usability of the Semantic Search component is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work.56 in which we apply a set of test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures. A separate evaluation of the usability of the Semantic Search component is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work. 57 57 58 58 \begin{itemize} … … 66 66 This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components and the results and findings of the \emph{evaluation}. 67 67 68 One promising by-product of the work will be the original dataset expressed as RDF with links into existing external resources (ontologies, knowledgebases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\f ootnote{\url{http://linkeddata.org/}} in the \emph{Web of Data}.68 One promising by-product of the work will be the original dataset expressed as RDF with links into existing external resources (ontologies, knowledgebases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/} in the \emph{Web of Data}. 69 69 70 70 -
SMC4LRT/chapters/Literature.tex
r2703 r3140 60 60 This is very much in line with the broad \textit{Linked Open Data} effort as proposed by Berners-Lee \cite{TimBL2006} and being pursuit across many discplines. (This topic is supported also by the EU Commission within the FP7.\footnote{\url{http://cordis.europa.eu/fetch?CALLER=PROJ\_ICT&ACTION=D&CAT=PROJ&RCN=95562}}) A very recent comprehensive overview of the principles of Linked Data and current applications is the book by Heath and Bizer \cite{HeathBizer2011}, that shall serve as a practical guide for this specific task. 61 61 62 Formate: 63 Turtle \furl{http://www.w3.org/TeamSubmission/turtle/\#sec-grammar-comments} 64 RDFa\furl{http://en.wikipedia.org/wiki/RDFa} 65 EDM\furl{http://europeana.ontotext.com/resource/edm/hasType?role=all} 66 67 68 \todocite{http://ldl2012.lod2.eu/program/proceedings} 69 \todoin{check LDpath}\furl{http://code.google.com/p/ldpath/} 70 62 71 63 72 \subsection{Schema / Ontology Mapping} -
SMC4LRT/chapters/SMC.tex
r2703 r3140 180 180 \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/} 181 181 182 \todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)} 183 182 184 defining the Mapping: 183 185 \begin{enumerate} … … 217 219 AF + DCR + RR 218 220 219 220 221 221 \section{Summary} 222 223 224 -
SMC4LRT/chapters/System.tex
r2703 r3140 62 62 \todocode{install Jena + fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site} 63 63 64 \todocode{check install siren}\furl{http://siren.sindice.com/} 65 \todocode{check install Virtuoso}\furl{http://ods.openlinksw.com/wiki/ODS/} 66 \todocode{check install Neo4J} 67 \todocode{check install ontology browser} 68 69 semantic search component in the Linked Media Framework 70 \todocode{!!! check install LMF - kiwi - SemanticSearch !!!}\furl{http://code.google.com/p/kiwi/wiki/SemanticSearch} 71 72 \todoin{check SARQ}\furl{http://github.com/castagna/SARQ} 73 64 74 \todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?} 65 75 … … 67 77 \section{User Interface?} 68 78 69 \subsection {Query Input}79 \subsection*{Query Input} 70 80 71 \subsection {Columns}81 \subsection*{Columns} 72 82 73 \subsection {Summaries}83 \subsection*{Summaries} 74 84 75 \subsection {Differential Views}85 \subsection*{Differential Views} 76 86 Visualize impact of given mapping in terms of covered dataset (number of matched records). 77 87 78 \subsection {Visualization}88 \subsection*{Visualization} 79 89 Landscape, Treemap, SOM 80 90
Note: See TracChangeset
for help on using the changeset viewer.