Context Navigation

← Previous Change
Wiki History
Next Change →

XML database for CMDI

Timestamp:: 06/04/12 12:05:57 (12 years ago)
Author:: dietuyt
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SoftwareDevelopment/Archive/XML database for CMDI

                       v1
+{{{
+#!comment
+replace the owner and date with the relevant ones
+}}}
+[=#topofpage]
+''Responsible for this page: [wiki:larlam Lari] [[BR]]
+Last content check: 2012-05-03''
+{{{
+#!html
+<h3>Purpose</h3>
+}}}
+{{{
+#!comment
+Replace the current purpose with your own (not this line ;))
+}}}
+Status = draft.
+{{{
+#!html
+<h3>Contents</h3>
+}}}
+[[PageOutline(1-2, , inline)]]
+{{{
+#!comment
+Obviously, your page starts below this block
+}}}
+= Questions =
+'''Goal:''' implement a sophisticated metadata search for [http://catalog.clarin.eu/oai-harvester/resultsets/ all CMDI files], preferrably with a transparent back-end like an XML database
+* Do we need indexes for all possible XPaths in each CMDI file or can we optimize?
+  * only index the large collections (like CMDI'fied IMDI)?
+  * other stragegies?
+* How do eXist indexes scale with the amount of CMDI files fed to it?
+  * currently: about ~~220.000~~ [http://catalog.clarin.eu/ds/vlo/ 420.000] CMDI files
+  * what for 500.000? 2 mio?
+* BaseX: can we import the [http://catalog.clarin.eu/oai-harvester/resultsets/ 420.000 CMDI] files in BaseX?
+  * Which version to use? 7.0? 7.2?
+  * Experiences from Lexus (Andre)
+* If eXist and BaseX are not sufficient, which other options are there?
+== Typical Queries ==
+Namespace declaration is not repeated everywhere but should be included in all queries:
+{{{
+declare default element namespace "http://www.clarin.eu/cmd/";
+}}}
+ * '''Example 1:''' find any XML document that includes the word "reindeer"
+  * Basic case (BaseX syntax):
+    {{{
+for $doc in /CMD[Components contains text 'reindeer']
+return $doc/Header/MdSelfLink
+    }}}
+  * Should be equivalent but doesn't return any results (BaseX syntax):
+    {{{
+for $doc in /CMD[Components//text() contains text 'reindeer']
+return $doc/Header/MdSelfLink
+    }}}
+  * Example without using text index (should be eXist and BaseX compatible):
+    {{{
+for $doc in /CMD[contains(Components//text(), 'reindeer')]
+return $doc/Header/MdSelfLink
+    }}}
+  * Using text index (eXist syntax):
+    {{{
+for $doc in ft:search(collection('/db/cmdi/mpi')/CMD, "Components:reindeer")
+return $doc/Header/MdSelfLink
+    }}}
+  * Alternative syntax, using text index (eXist):
+    {{{
+for $doc in collection('/db/cmdi/mpi')/CMD[ft:query(.//Components, "reindeer")]
+return $doc/Header/MdSelfLink
+    }}}
+ * '''Example 2:''' find any XML document that has the XPath value /CMD/Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id = ISO639-3:rus
+    {{{
+for $doc in /CMD[Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id = 'ISO639-3:rus']
+return $doc/Header/MdSelfLink
+    }}}
+ * '''Example 3:''' find any XML document that has the XPath value /CMD/Components/Session/MDGroup/Actors/Actor/Sex = female and /CMD/Components/Session/MDGroup/Actors/Actor/BirthDate < 1980
+    {{{
+for $doc in /CMD[Components/Session/MDGroup/Actors/Actor[Sex = 'Female' and (BirthDate castable as xs:date and year-from-date(BirthDate cast as xs:date) lt 1980)]]
+return $doc/Header/MdSelfLink
+    }}}
+= Action list =
+ * prepare test machine (done, linux box in Lari's office)
+ * install and test:
+   * exist 2.0
+   * basex 7.2
+   * [http://blog.msbbc.co.uk/2010/04/marklogic-x64-install-on-ubuntu-104.html marklogic]
+----
+= Test Results and Notes =
+'''Note''' about numbers of ''reindeer'' matches: The word "reindeer" appears in 41 records. There are 45 records with the ''word'' reindeer in case-insensitive mode. And finally, a case insensitive grep of ''reindeer'' finds 51 matches (6 of them are not "words" -- generally appearing as part of a pathname). So all our search results seem to be right even though they differ. Keep this in mind when comparing numbers!
+The time measurements should only be taken as a guideline rather than as exact figures, as there was no attempt to collect means of multiple measurements etc.
+== eXist ==
+The test version is eXist 2.0 Tech Preview.
+I made a tarball before importing any records, and then reverted to it between tests to ensure the order of tests does not affect the result. At the default level of {{{Xmx=512m}}}, eXist ran out of memory in the ingesting stage, so I doubled the maximum heap size to 1 gigabyte (the setting is in wrapper.conf -- a bit hard to find). The full eXist installation without any data in it takes 184m of disk space.
+The dataset was divided into 3 collections, the same way it is in the data directory. These were imported one at a time, but queries were targeted at the entire database. Three configurations were tested:
+=== No Text Index ===
+No {{{collection.xconf}}} is created, so default settings are used. There is no Lucene index at all.
+Importing the collections took 2113+2793+346 seconds, for a total of 1 hour 27 minutes. The eXist directory grew to 5.8 gigabytes. The data itself takes 4.4 gigabytes, so the index is around 1.3 times the size of the data.
+For example 1, only the variant without text index can be used (of course). After several minutes of processing at maximum processor load, it runs out of memory and dies. (From previous research it is known that search time grows faster than linearly for this search, so this search could be used in subcollections provided they are small enough.)
+Example 2 is executed in a few seconds and returns 275 records.
+Example 3 is executed within ten seconds and returns 5583 matches.
+=== Full Index ===
+Every subtree of the full XML document is indexed as a separate text record. The {{{collection.xconf}}} contains something along these lines:
+{{{
+    <lucene>
+      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
+      <text match="//*"/>
+    </lucene>
+}}}
+Importing took 3799 sec for OLAC/DC, 19413 sec for MPI archive; after running for just over 8 hours, the import job for the CMDI providers stopped producing output. It was more than 95% done at the time, so I'm taking this as a guideline indication of performance anyway. 44 gigabytes of disk space was used (of which 39g by Lucene indices).
+Example queries 2 and 3 produced the same results as above and ran at roughly equal speed except on the first execution, which seemed to take longer. Variance was too high for a definite conclusion (in the range 9-61 sec for example 2).
+For example 1, the variant using {{{ft:search}}} ran out of memory after some time, but the variant with {{{ft:query}}} was very fast, with mean response time of 2 seconds, and returned 45 matches (which is OK -- the difference with grep's 41 matches is due to Lucene text search being case insensitive by default).
+Removing the OLAC and MPI collections took 30 minutes; I gave up on removing the CMDI collection after 3.5 hours. Removing records caused disk usage to increase from 44 to '''207 gigabytes''', which seems a lot for a journal of removed records. Perhaps the prematurely terminated ingestion left something in an inconsistent state.
+=== Document Root Index ===
+The root node of each records is indexed, meaning that the entire text content of a record is one Lucene record. Configuration in {{{collection.xconf}}}:
+{{{
+    <lucene>
+      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
+      <text match="/"/>
+    </lucene>
+}}}
+The above causes eXist to throw a !StringIndexOutOfBoundsException.
+We could change the syntax, but why don't we use some domain specific knowledge, i.e. that the main content is under Components:
+{{{
+    <lucene>
+      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
+      <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
+      <text match="/cmdi:CMD/cmdi:Components"/>
+    </lucene>
+}}}
+With this setting, we see long ingesting times of 32727+17817+3889 sec, about 15 hours in total, but at least completed without errors. Disk usage, 6.2 gigabytes, is quite reasonable.
+Examples 2 and 3 return the same number of items and take a broadly similar amount of time as above (again with a high variance). For example 1, just like above, the variant using {{{ft:search}}} did not work, but the variant with {{{ft:query}}} was very quick, ~1 second mean response time, making it the quickest of all the queries.
+=== Summary: eXist ===
+Obviously, among these, only the root index alternative is an efficient configuration for eXist with text search. This highlights the configuration complexity compared to BaseX. Not using Lucene at all is an option only if no text searches are needed, as they are completely impractical without it. Using Lucene increases the ingestion time by a factor of ten, but does yield very fast text search. One problem specific to eXist is that the queries have to explicitly use the full text search functions to be effective, so it is not possible to use a generic query for all configurations.
+== !MarkLogic ==
+Installed !MarkLogic 5.0-3 x86 64.
+=== Importing ===
+Importing the collection (441553 documents) took around 1 hour and 50 minutes (6633.46464558 s). The /lat/MarkLogic/db/Forests/cmdi directory is 5,662,560 Kb large. This with the default setting for indexes etc.
+=== Queries ===
+==== Query 1 ====
+XQuery fulltext isn't supported by ML, but the following pure XQuery version works.
+{{{
+xquery version "1.0-ml";
+declare default element namespace "http://www.clarin.eu/cmd/";
+for $doc in /CMD[exists(//text()[contains(lower-case(.), 'reindeer')])]
+return $doc/Header/MdSelfLink
+}}}
+Returns 51 results. However, it takes several minutes to finish.
+Full text search in ML [http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf] is not intermingled with XQuery (although there is an xquery module to trigger the search). However, one can create paths (see 3.1.3). But if you can then do matches on any descendants is not yet clear to me. For now I tried just the basic full text search and used the supplied match path to trim the result down to only matches in /CMD/Components.
+{{{
+xquery version "1.0-ml";
+declare default element namespace "http://www.clarin.eu/cmd/";
+import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";
+let $res := search:search("reindeer", search:get-default-options(), 0, 100)//search:result[exists(.//search:match[contains(@path,'/*:CMD/*:Components')])]
+return <res count="{count($res)}">{for $doc in $res return doc($doc/@uri)/CMD/Header/MdSelfLink}</res>
+}}}
+This returns 51 matches in on average 0.1094399 second.
+{{{
+run  1 : 0.107184
+run  2 : 0.109612
+run  3 : 0.110016
+run  4 : 0.109592
+run  5 : 0.10963
+run  6 : 0.109767
+run  7 : 0.109981
+run  8 : 0.109469
+run  9 : 0.109755
+run 10 : 0.109393
+average: 0.1094399
+}}}
+==== Query 2 ====
+{{{
+xquery version "1.0-ml";
+declare default element namespace "http://www.clarin.eu/cmd/";
+let $res := for $doc in /CMD[Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id = 'ISO639-3:rus'] return $doc/Header/MdSelfLink
+return <res count="{count($res)}">{$res}</res>
+}}}
+Works and returns 275 results in on average 0.0486831 second.
+{{{
+run  1 : 0.049722
+run  2 : 0.048433
+run  3 : 0.048451
+run  4 : 0.048408
+run  5 : 0.048517
+run  6 : 0.048512
+run  7 : 0.048667
+run  8 : 0.048971
+run  9 : 0.04869
+run 10 : 0.04846
+average: 0.0486831
+}}}
+==== Query 3 ====
+{{{
+xquery version "1.0-ml";
+declare default element namespace "http://www.clarin.eu/cmd/";
+let $res := for $doc in /CMD[Components/Session/MDGroup/Actors/Actor[Sex = 'Female' and (BirthDate castable as xs:date and year-from-date(BirthDate cast as xs:date) lt 1980)]] return $doc/Header/MdSelfLink
+return <res count="{count($res)}">{$res}</res>
+}}}
+Works and returns 5583 in on average 14.3633323 seconds.
+{{{
+run  1 : 14.283688
+run  2 : 14.259531
+run  3 : 14.375429
+run  4 : 14.334855
+run  5 : 14.368139
+run  6 : 14.427776
+run  7 : 14.379539
+run  8 : 14.410527
+run  9 : 14.364832
+run 10 : 14.429007
+average: 14.3633323
+}}}
+== BaseX ==
+This section is a preliminary analyzes on BaseX and shows the results which are currently already known. Tests were performed just on BaseX 7.2.1 and quitting both other databases processes (eXist and !MarkLogic) so BaseX would have the most resources possible.
+=== Creating the database ===
+Instead of creating a database and later import the documents into it, I specify the path where the documents are located so BaseX will ingest and index these documents upon creation time. The results follow:
+{{{
+Creating Database (ms)          3888975,43     ms
+Indexing Text (ms)                          98506,15     ms
+Indexing Attribute Values (ms)        7204,23     ms
+Indexing Full-Text (ms)           161025,59     ms
+Total (s)               ~ 1h 9min 27s
+}}}
+=== Initial disk usage ===
+{{{
+DB size (including Indexes and whitespace chop off)     4,9     GB
+Text index               151  MB
+Attribte Index            21  MB
+Full Text Index        505  MB
+}}}
+=== Query 1 ===
+The query needed to be rewritten in order to take advantage of the existing indexes. The executed query was:
+{{{
+declare default element namespace "http://www.clarin.eu/cmd/";
+let $res := for $doc in collection('CMD')/CMD[Components//text() contains text 'reindeer' using case sensitive]
+return $doc/Header/MdSelfLink
+return <res count="{count($res)}">{$res} </res>
+}}}
+This query will trigger the full text index and also return the number of matches as an attribute in its results root node. Odd enough in BaseX and running the query as case sensitive the number of matches is just 39!
+Results for the first 10 runs are presented bellow:
+{{{
+,55  ms
+,65 ms
+,49 ms
+,47 ms
+,75 ms
+,71 ms
+,41 ms
+,93 ms
+,41 ms
+,46 ms
+Average 116,683 ms
+}}}
+=== Query 2 ===
+Again here the query was rewritten to:
+{{{
+declare default element namespace "http://www.clarin.eu/cmd/";
+let $res := for $doc in collection('CMD')/CMD[Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id/text() = 'ISO639-3:rus']
+return $doc/Header/MdSelfLink
+return <res count="{count($res)}">{$res} </res>
+}}}
+This query will trigger the text index and it returns 275 matches.
+Results for the first 10 runs are presented bellow:
+{{{
+,46  ms
+,7   ms
+,55 ms
+,71 ms
+,49  ms
+,02 ms
+,3   ms
+,94 ms
+,08 ms
+,44 ms
+Average 100,269 ms
+}}}
+=== Query 3 ===
+XQuery had to be rewritten to:
+{{{
+declare default element namespace "http://www.clarin.eu/cmd/";
+let $res := for $doc in collection('CMD')/CMD/Components/Session/MDGroup/Actors/Actor[Sex/text() = 'Female' and (BirthDate castable as xs:date and year-from-date(BirthDate cast as xs:date) lt 1980)]/../../../../..
+return $doc/Header/MdSelfLink
+return <res count="{count($res)}">{$res} </res>
+}}}
+This query will trigger the text index for the comparison "Sex = 'Female'" and use regular walking for the "!BirthDate < 1980" condition (BaseX 7.2.1 supports value range indexed search based on string comparison, however the usage of "extract-year-from-date()" function makes it impossible for BaseX to use the index and I was not able to rewrite the query in such a way that I would get the same result set. If I remove the cast and the extract year function it yields to 8550 matches!).
+It returns 5583 matches and the results for the first 10 runs are:
+{{{
+,9  ms
+,43 ms
+,03 ms
+,95 ms
+,57 ms
+,19 ms
+,67 ms
+,87 ms
+,74 ms
+,05 ms
+Average 1759,34 ms
+}}}
+=== Measurements without disk cache ===
+The configurations used in the measurements presented bellow was precisely the same used above. The only difference is that here the disk cache was cleared before every query. This was done issuing the next command in the terminal:
+{{{sync ; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'}}}
+The results follow.
+==== Query 1 ====
+{{{
+,96   ms
+,38   ms
+,72   ms
+,94   ms
+,98   ms
+,41   ms
+,96   ms
+,01   ms
+,99   ms
+,95   ms
+Average 1309,73 ms
+}}}
+==== Query 2 ====
+{{{
+,63   ms
+,63   ms
+,11   ms
+,77   ms
+,57   ms
+,06   ms
+,73   ms
+,47   ms
+,43   ms
+,51   ms
+Average 7989,691        ms
+}}}
+==== Query 3 ====
+{{{
+,28   ms
+,7     ms
+,29   ms
+,82   ms
+,79   ms
+,01   ms
+,49   ms
+,56   ms
+,04   ms
+,1     ms
+Average 70806,208   ms
+}}}
+=== Other considerations ===
+Insertion and deletion times were not measured yet but it is known (from LEXUS experience) that the current version of BaseX is fairly unstable if use incremental indexes (UPDINDEX flag). Previous versions do not support incremental indexes at all. So BaseX usage with incremental indexes should be not considered for the time being. The alternative will be to run an db-optimize command in a regular bases, or every time the documents change. Which could be feasible or not depending of how often is the database expected to be written!