wiki:SoftwareDevelopment/Archive/XML database for CMDI

Responsible for this page: Lari?
Last content check: 2012-06-12

A copy of this page has been placed at http://trac.clarin.eu/wiki/XML%20database%20for%20CMDI

Contents

  1. Table of Results
  2. Questions
    1. Typical Queries
  3. Action list
  4. Test Results and Notes
    1. eXist
    2. MarkLogic
    3. BaseX

Table of Results

This is just a summary of the results. See below for details.

A single number indicates a single measurement; otherwise the number shown is the average and n is the number of measurements. An interval of 95% confidence in the mean is shown for those cases where there are enough measurements to support the calculation. A number in bold indicates the best result in that row. For query 1 there is no statistically significant "best" result.

eXist (root index) BaseX MarkLogic Notes
Importing Time (s) 52909 (n=2) 4167 6633
Disk usage (GiB) 6.0 4.9 5.4 Includes database and indices
Query 1 (ms) 226 ± 104 (n=10) 116.7 ± 177.7 (n=10) 109.4 ± 1.6 (n=10) BaseX variance is so high that with n=10 the result is statistically useless
Query 2 (ms) 3876 ± 818 (n=10) 100.3 ± 61.6 (n=10) 48.7 ± 0.8 (n=10)
Query 3 (ms) 11347 ± 669 (n=10) 1759 ± 373 (n=10) 14363 ± 111 (n=10)

Many caveats apply, including but not limited to these:

  • The numbers for eXist are collected from the XQuery IDE (eXide), so they likely overstate the time taken compared to the same query via direct API access.
  • According to Leif-Jöran Olsson of the eXist project, having the journal and database files on different filesystems (drives) could make a substantial performance difference for eXist. This has not been tested.
  • Mr. Olsson contends that the importing time we measured is an order of magnitude higher than normal. However, the time was stable in a repeated measurement. There is no obvious explanation for the discrepancy.
  • Limitations apply to using text indices in XQuery statements in MarkLogic; see more below.

Questions

Goal: implement a sophisticated metadata search for all CMDI files, preferrably with a transparent back-end like an XML database

  • Do we need indexes for all possible XPaths in each CMDI file or can we optimize?
    • only index the large collections (like CMDI'fied IMDI)?
    • other stragegies?
  • How do eXist indexes scale with the amount of CMDI files fed to it?
    • currently: about 420.000 CMDI files
    • what for 500.000? 2 mio?
  • BaseX: can we import the 420.000 CMDI files in BaseX?
    • Which version to use? 7.0? 7.2?
    • Experiences from Lexus (Andre)
  • If eXist and BaseX are not sufficient, which other options are there?

Typical Queries

Namespace declaration is not repeated everywhere but should be included in all queries:

declare default element namespace "http://www.clarin.eu/cmd/";
  • Example 1: find any XML document that includes the word "reindeer"
    • Basic case (BaseX syntax):
      for $doc in /CMD[Components contains text 'reindeer']
      return $doc/Header/MdSelfLink
      
    • Should be equivalent but doesn't return any results (BaseX syntax):
      for $doc in /CMD[Components//text() contains text 'reindeer']
      return $doc/Header/MdSelfLink
      
    • Example without using text index (should be eXist and BaseX compatible):
      for $doc in /CMD[contains(Components//text(), 'reindeer')]
      return $doc/Header/MdSelfLink
      
    • Using text index (eXist syntax):
      for $doc in ft:search(collection('/db/cmdi/mpi')/CMD, "Components:reindeer")
      return $doc/Header/MdSelfLink
      
    • Alternative syntax, using text index (eXist):
      for $doc in collection('/db/cmdi/mpi')/CMD[ft:query(.//Components, "reindeer")]
      return $doc/Header/MdSelfLink
      
  • Example 2: find any XML document that has the XPath value /CMD/Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id = ISO639-3:rus
    for $doc in /CMD[Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id = 'ISO639-3:rus']
    return $doc/Header/MdSelfLink
    
  • Example 3: find any XML document that has the XPath value /CMD/Components/Session/MDGroup/Actors/Actor/Sex = female and /CMD/Components/Session/MDGroup/Actors/Actor/BirthDate < 1980
    for $doc in /CMD[Components/Session/MDGroup/Actors/Actor[Sex = 'Female' and (BirthDate castable as xs:date and year-from-date(BirthDate cast as xs:date) lt 1980)]]
    return $doc/Header/MdSelfLink
    

Action list

  • prepare test machine (done, linux box in Lari's office)
  • install and test:

Test Results and Notes

Note about numbers of reindeer matches: The word "reindeer" appears in 41 records. There are 45 records with the word reindeer in case-insensitive mode. And finally, a case insensitive grep of reindeer finds 51 matches (6 of them are not "words" -- generally appearing as part of a pathname). So all our search results seem to be right even though they differ. Keep this in mind when comparing numbers!

eXist

The test version is eXist 2.0 Tech Preview.

I made a tarball before importing any records, and then reverted to it between tests to ensure the order of tests does not affect the result. At the default level of Xmx=512m, eXist ran out of memory in the ingesting stage, so I doubled the maximum heap size to 1 gigabyte (the setting is in wrapper.conf -- a bit hard to find). The full eXist installation without any data in it takes 184m of disk space.

The dataset was divided into 3 collections, the same way it is in the data directory. These were imported one at a time, but queries were targeted at the entire database. Three configurations were tested:

No Text Index

No collection.xconf is created, so default settings are used. There is no Lucene index at all.

Importing the collections took 2113+2793+346 seconds, for a total of 1 hour 27 minutes. The eXist directory grew to 5.8 gigabytes. The data itself takes 4.4 gigabytes, so the index is around 1.3 times the size of the data.

For example 1, only the variant without text index can be used (of course). After several minutes of processing at maximum processor load, it runs out of memory and dies. (From previous research it is known that search time grows faster than linearly for this search, so this search could be used in subcollections provided they are small enough.)

Example 2 is executed in a few seconds and returns 275 records.

Example 3 is executed within ten seconds and returns 5583 matches.

Full Index

Every subtree of the full XML document is indexed as a separate text record. The collection.xconf contains something along these lines:

    <lucene>
      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
      <text match="//*"/>
    </lucene>

Importing took 3799 sec for OLAC/DC, 19413 sec for MPI archive; after running for just over 8 hours, the import job for the CMDI providers stopped producing output. It was more than 95% done at the time, so I'm taking this as a guideline indication of performance anyway. 44 gigabytes of disk space was used (of which 39g by Lucene indices).

Example queries 2 and 3 produced the same results as above and ran at roughly equal speed except on the first execution, which seemed to take longer. Variance was too high for a definite conclusion (in the range 9-61 sec for example 2).

For example 1, the variant using ft:search ran out of memory after some time, but the variant with ft:query was very fast, with mean response time of 2 seconds, and returned 45 matches (which is OK -- the difference with grep's 41 matches is due to Lucene text search being case insensitive by default).

Removing the OLAC and MPI collections took 30 minutes; I gave up on removing the CMDI collection after 3.5 hours. Removing records caused disk usage to increase from 44 to 207 gigabytes, which seems a lot for a journal of removed records. Perhaps the prematurely terminated ingestion left something in an inconsistent state.

Document Root Index

The root node of each records is indexed, meaning that the entire text content of a record is one Lucene record. Configuration in collection.xconf:

    <lucene>
      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
      <text match="/"/>
    </lucene>

The above causes eXist to throw a StringIndexOutOfBoundsException.

We could change the syntax, but why don't we use some domain specific knowledge, i.e. that the main content is under Components:

    <lucene>
      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
      <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
      <text match="/cmdi:CMD/cmdi:Components"/>
    </lucene>

With this setting, we see long ingesting times of 32727+17817+3889 sec, about 15 hours in total, but at least completed without errors. Disk usage, 6.2 gigabytes, is quite reasonable.

Examples 2 and 3 return the same number of items and take a broadly similar amount of time as above (again with a high variance). For example 1, just like above, the variant using ft:search did not work, but the variant with ft:query was very quick, ~1 second mean response time, making it the quickest of all the queries.

Summary: eXist

Obviously, among these, only the root index alternative is an efficient configuration for eXist with text search. This highlights the configuration complexity compared to BaseX. Not using Lucene at all is an option only if no text searches are needed, as they are completely impractical without it. Using Lucene increases the ingestion time by a factor of ten, but does yield very fast text search. One problem specific to eXist is that the queries have to explicitly use the full text search functions to be effective, so it is not possible to use a generic query for all configurations.

MarkLogic

Installed MarkLogic 5.0-3 x86 64.

Importing

Importing the collection (441553 documents) took around 1 hour and 50 minutes (6633.46464558 s). The /lat/MarkLogic/db/Forests/cmdi directory is 5,662,560 Kb large. This with the default setting for indexes etc.

Queries

Query 1

XQuery fulltext isn't supported by ML, but the following pure XQuery version works.

xquery version "1.0-ml";
declare default element namespace "http://www.clarin.eu/cmd/";
for $doc in /CMD[exists(//text()[contains(lower-case(.), 'reindeer')])]
return $doc/Header/MdSelfLink

Returns 51 results. However, it takes several minutes to finish.

Full text search in ML http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf is not intermingled with XQuery (although there is an xquery module to trigger the search). However, one can create paths (see 3.1.3). But if you can then do matches on any descendants is not yet clear to me. For now I tried just the basic full text search and used the supplied match path to trim the result down to only matches in /CMD/Components.

xquery version "1.0-ml";
declare default element namespace "http://www.clarin.eu/cmd/";
import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";

let $res := search:search("reindeer", search:get-default-options(), 0, 100)//search:result[exists(.//search:match[contains(@path,'/*:CMD/*:Components')])]
return <res count="{count($res)}">{for $doc in $res return doc($doc/@uri)/CMD/Header/MdSelfLink}</res>

This returns 51 matches in on average 0.1094399 second.

run  1 : 0.107184
run  2 : 0.109612
run  3 : 0.110016
run  4 : 0.109592
run  5 : 0.10963
run  6 : 0.109767
run  7 : 0.109981
run  8 : 0.109469
run  9 : 0.109755
run 10 : 0.109393
average: 0.1094399

Query 2

xquery version "1.0-ml";
declare default element namespace "http://www.clarin.eu/cmd/";
let $res := for $doc in /CMD[Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id = 'ISO639-3:rus'] return $doc/Header/MdSelfLink
return <res count="{count($res)}">{$res}</res>

Works and returns 275 results in on average 0.0486831 second.

run  1 : 0.049722
run  2 : 0.048433
run  3 : 0.048451
run  4 : 0.048408
run  5 : 0.048517
run  6 : 0.048512
run  7 : 0.048667
run  8 : 0.048971
run  9 : 0.04869
run 10 : 0.04846
average: 0.0486831

Query 3

xquery version "1.0-ml";
declare default element namespace "http://www.clarin.eu/cmd/";

let $res := for $doc in /CMD[Components/Session/MDGroup/Actors/Actor[Sex = 'Female' and (BirthDate castable as xs:date and year-from-date(BirthDate cast as xs:date) lt 1980)]] return $doc/Header/MdSelfLink
return <res count="{count($res)}">{$res}</res>

Works and returns 5583 in on average 14.3633323 seconds.

run  1 : 14.283688
run  2 : 14.259531
run  3 : 14.375429
run  4 : 14.334855
run  5 : 14.368139
run  6 : 14.427776
run  7 : 14.379539
run  8 : 14.410527
run  9 : 14.364832
run 10 : 14.429007
average: 14.3633323

BaseX

This section is a preliminary analyzes on BaseX and shows the results which are currently already known. Tests were performed just on BaseX 7.2.1 and quitting both other databases processes (eXist and MarkLogic) so BaseX would have the most resources possible.

Creating the database

Instead of creating a database and later import the documents into it, I specify the path where the documents are located so BaseX will ingest and index these documents upon creation time. The results follow:

Creating Database (ms)		3888975,43     ms
Indexing Text (ms)		            98506,15     ms
Indexing Attribute Values (ms)	      7204,23     ms
Indexing Full-Text (ms)		  161025,59     ms
			
Total (s)		~ 1h 9min 27s

Initial disk usage

DB size (including Indexes and whitespace chop off) 	4,9	GB

Text index               151  MB
Attribte Index            21  MB
Full Text Index        505  MB

Query 1

The query needed to be rewritten in order to take advantage of the existing indexes. The executed query was:

declare default element namespace "http://www.clarin.eu/cmd/";
let $res := for $doc in collection('CMD')/CMD[Components//text() contains text 'reindeer' using case sensitive]
return $doc/Header/MdSelfLink
return <res count="{count($res)}">{$res} </res>

This query will trigger the full text index and also return the number of matches as an attribute in its results root node. Odd enough in BaseX and running the query as case sensitive the number of matches is just 39! Results for the first 10 runs are presented bellow:

	196,55	ms
	  80,65	ms
	  74,49	ms
	  99,47	ms
	  66,75	ms
	  66,71	ms
	  95,41	ms
	  72,93	ms
	 349,41	ms
	  64,46	ms

Average	116,683	ms

Query 2

Again here the query was rewritten to:

declare default element namespace "http://www.clarin.eu/cmd/";
let $res := for $doc in collection('CMD')/CMD[Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id/text() = 'ISO639-3:rus']
return $doc/Header/MdSelfLink
return <res count="{count($res)}">{$res} </res>

This query will trigger the text index and it returns 275 matches. Results for the first 10 runs are presented bellow:

	182,46	ms
	107,7	ms
	  85,55	ms
	  89,71	ms
	106,49	ms
          78,02	ms
	107,3	ms
	  77,94	ms
	  76,08	ms
	  91,44	ms

Average	100,269	ms

Query 3

XQuery had to be rewritten to:

declare default element namespace "http://www.clarin.eu/cmd/";
let $res := for $doc in collection('CMD')/CMD/Components/Session/MDGroup/Actors/Actor[Sex/text() = 'Female' and (BirthDate castable as xs:date and year-from-date(BirthDate cast as xs:date) lt 1980)]/../../../../..
return $doc/Header/MdSelfLink
return <res count="{count($res)}">{$res} </res>

This query will trigger the text index for the comparison "Sex = 'Female'" and use regular walking for the "BirthDate < 1980" condition (BaseX 7.2.1 supports value range indexed search based on string comparison, however the usage of "extract-year-from-date()" function makes it impossible for BaseX to use the index and I was not able to rewrite the query in such a way that I would get the same result set. If I remove the cast and the extract year function it yields to 8550 matches!). It returns 5583 matches and the results for the first 10 runs are:

	2164,9	ms
	1702,43	ms
	1713,03	ms
	1988,95	ms
	1617,57	ms
	1660,19	ms
	1633,67	ms
	1882,87	ms
	1618,74	ms
	1611,05	ms

Average	1759,34	ms

Measurements without disk cache

The configurations used in the measurements presented bellow was precisely the same used above. The only difference is that here the disk cache was cleared before every query. This was done issuing the next command in the terminal:

sync ; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

The results follow.

Query 1

        1283,96   ms
        1192,38   ms
        1483,72   ms
        1516,94   ms
        1233,98   ms
        1450,41   ms
        1233,96   ms
        1234,01   ms
        1233,99   ms
        1233,95   ms

Average	1309,73	ms

Query 2

        7977,63   ms
        7852,63   ms
        8003,11   ms
        7861,77   ms
        8061,57   ms
        8028,06   ms
        7994,73   ms
        7961,47   ms
        7936,43   ms
        8219,51   ms

Average	7989,691	ms

Query 3

        70524,28   ms
        70732,7     ms
        70484,29   ms
        71095,82   ms
        71520,79   ms
        70461,01   ms
        70048,49   ms
        70096,56   ms
        71845,04   ms
        71253,1     ms

Average	70806,208   ms

Other considerations

Insertion and deletion times were not measured yet but it is known (from LEXUS experience) that the current version of BaseX is fairly unstable if use incremental indexes (UPDINDEX flag). Previous versions do not support incremental indexes at all. So BaseX usage with incremental indexes should be not considered for the time being. The alternative will be to run an db-optimize command in a regular bases, or every time the documents change. Which could be feasible or not depending of how often is the database expected to be written!

Last modified 12 years ago Last modified on 06/12/12 13:16:21