Changes between Initial Version and Version 1 of SoftwareDevelopment/Archive/XML database for CMDI


Ignore:
Timestamp:
06/04/12 12:05:57 (12 years ago)
Author:
dietuyt
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SoftwareDevelopment/Archive/XML database for CMDI

    v1 v1  
     1{{{
     2#!comment
     3replace the owner and date with the relevant ones
     4}}}
     5[=#topofpage]
     6
     7''Responsible for this page: [wiki:larlam Lari] [[BR]]
     8Last content check: 2012-05-03''
     9{{{
     10#!html
     11<h3>Purpose</h3>
     12}}}
     13
     14{{{
     15#!comment
     16Replace the current purpose with your own (not this line ;))
     17}}}
     18
     19Status = draft.
     20
     21{{{
     22#!html
     23<h3>Contents</h3>
     24}}}
     25
     26[[PageOutline(1-2, , inline)]]
     27
     28{{{
     29#!comment
     30Obviously, your page starts below this block
     31}}}
     32
     33= Questions =
     34
     35'''Goal:''' implement a sophisticated metadata search for [http://catalog.clarin.eu/oai-harvester/resultsets/ all CMDI files], preferrably with a transparent back-end like an XML database
     36
     37* Do we need indexes for all possible XPaths in each CMDI file or can we optimize?
     38  * only index the large collections (like CMDI'fied IMDI)?
     39  * other stragegies?
     40* How do eXist indexes scale with the amount of CMDI files fed to it?
     41  * currently: about ~~220.000~~ [http://catalog.clarin.eu/ds/vlo/ 420.000] CMDI files
     42  * what for 500.000? 2 mio?
     43* BaseX: can we import the [http://catalog.clarin.eu/oai-harvester/resultsets/ 420.000 CMDI] files in BaseX?
     44  * Which version to use? 7.0? 7.2?
     45  * Experiences from Lexus (Andre)
     46* If eXist and BaseX are not sufficient, which other options are there?
     47
     48== Typical Queries ==
     49
     50Namespace declaration is not repeated everywhere but should be included in all queries:
     51{{{
     52declare default element namespace "http://www.clarin.eu/cmd/";
     53}}}
     54
     55 * '''Example 1:''' find any XML document that includes the word "reindeer"
     56  * Basic case (BaseX syntax):
     57    {{{
     58for $doc in /CMD[Components contains text 'reindeer']
     59return $doc/Header/MdSelfLink
     60    }}}
     61  * Should be equivalent but doesn't return any results (BaseX syntax):
     62    {{{
     63for $doc in /CMD[Components//text() contains text 'reindeer']
     64return $doc/Header/MdSelfLink
     65    }}}
     66  * Example without using text index (should be eXist and BaseX compatible):
     67    {{{
     68for $doc in /CMD[contains(Components//text(), 'reindeer')]
     69return $doc/Header/MdSelfLink
     70    }}}
     71  * Using text index (eXist syntax):
     72    {{{
     73for $doc in ft:search(collection('/db/cmdi/mpi')/CMD, "Components:reindeer")
     74return $doc/Header/MdSelfLink
     75    }}}
     76  * Alternative syntax, using text index (eXist):
     77    {{{
     78for $doc in collection('/db/cmdi/mpi')/CMD[ft:query(.//Components, "reindeer")]
     79return $doc/Header/MdSelfLink
     80    }}}
     81
     82
     83 * '''Example 2:''' find any XML document that has the XPath value /CMD/Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id = ISO639-3:rus
     84    {{{
     85for $doc in /CMD[Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id = 'ISO639-3:rus']
     86return $doc/Header/MdSelfLink
     87    }}}
     88
     89 * '''Example 3:''' find any XML document that has the XPath value /CMD/Components/Session/MDGroup/Actors/Actor/Sex = female and /CMD/Components/Session/MDGroup/Actors/Actor/BirthDate < 1980
     90    {{{
     91for $doc in /CMD[Components/Session/MDGroup/Actors/Actor[Sex = 'Female' and (BirthDate castable as xs:date and year-from-date(BirthDate cast as xs:date) lt 1980)]]
     92return $doc/Header/MdSelfLink
     93    }}}
     94
     95
     96
     97= Action list =
     98
     99 * prepare test machine (done, linux box in Lari's office)
     100 * install and test:
     101   * exist 2.0
     102   * basex 7.2
     103   * [http://blog.msbbc.co.uk/2010/04/marklogic-x64-install-on-ubuntu-104.html marklogic]
     104
     105----
     106
     107= Test Results and Notes =
     108
     109'''Note''' about numbers of ''reindeer'' matches: The word "reindeer" appears in 41 records. There are 45 records with the ''word'' reindeer in case-insensitive mode. And finally, a case insensitive grep of ''reindeer'' finds 51 matches (6 of them are not "words" -- generally appearing as part of a pathname). So all our search results seem to be right even though they differ. Keep this in mind when comparing numbers!
     110
     111The time measurements should only be taken as a guideline rather than as exact figures, as there was no attempt to collect means of multiple measurements etc.
     112
     113== eXist ==
     114
     115The test version is eXist 2.0 Tech Preview.
     116
     117I made a tarball before importing any records, and then reverted to it between tests to ensure the order of tests does not affect the result. At the default level of {{{Xmx=512m}}}, eXist ran out of memory in the ingesting stage, so I doubled the maximum heap size to 1 gigabyte (the setting is in wrapper.conf -- a bit hard to find). The full eXist installation without any data in it takes 184m of disk space.
     118
     119The dataset was divided into 3 collections, the same way it is in the data directory. These were imported one at a time, but queries were targeted at the entire database. Three configurations were tested:
     120
     121=== No Text Index ===
     122
     123No {{{collection.xconf}}} is created, so default settings are used. There is no Lucene index at all.
     124
     125Importing the collections took 2113+2793+346 seconds, for a total of 1 hour 27 minutes. The eXist directory grew to 5.8 gigabytes. The data itself takes 4.4 gigabytes, so the index is around 1.3 times the size of the data.
     126
     127For example 1, only the variant without text index can be used (of course). After several minutes of processing at maximum processor load, it runs out of memory and dies. (From previous research it is known that search time grows faster than linearly for this search, so this search could be used in subcollections provided they are small enough.)
     128
     129Example 2 is executed in a few seconds and returns 275 records.
     130
     131Example 3 is executed within ten seconds and returns 5583 matches.
     132
     133=== Full Index ===
     134
     135Every subtree of the full XML document is indexed as a separate text record. The {{{collection.xconf}}} contains something along these lines:
     136
     137{{{
     138    <lucene>
     139      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
     140      <text match="//*"/>
     141    </lucene>
     142}}}
     143
     144Importing took 3799 sec for OLAC/DC, 19413 sec for MPI archive; after running for just over 8 hours, the import job for the CMDI providers stopped producing output. It was more than 95% done at the time, so I'm taking this as a guideline indication of performance anyway. 44 gigabytes of disk space was used (of which 39g by Lucene indices).
     145
     146Example queries 2 and 3 produced the same results as above and ran at roughly equal speed except on the first execution, which seemed to take longer. Variance was too high for a definite conclusion (in the range 9-61 sec for example 2).
     147
     148For example 1, the variant using {{{ft:search}}} ran out of memory after some time, but the variant with {{{ft:query}}} was very fast, with mean response time of 2 seconds, and returned 45 matches (which is OK -- the difference with grep's 41 matches is due to Lucene text search being case insensitive by default).
     149
     150Removing the OLAC and MPI collections took 30 minutes; I gave up on removing the CMDI collection after 3.5 hours. Removing records caused disk usage to increase from 44 to '''207 gigabytes''', which seems a lot for a journal of removed records. Perhaps the prematurely terminated ingestion left something in an inconsistent state.
     151
     152=== Document Root Index ===
     153
     154The root node of each records is indexed, meaning that the entire text content of a record is one Lucene record. Configuration in {{{collection.xconf}}}:
     155
     156{{{
     157    <lucene>
     158      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
     159      <text match="/"/>
     160    </lucene>
     161}}}
     162
     163The above causes eXist to throw a !StringIndexOutOfBoundsException.
     164
     165We could change the syntax, but why don't we use some domain specific knowledge, i.e. that the main content is under Components:
     166
     167{{{
     168    <lucene>
     169      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
     170      <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
     171      <text match="/cmdi:CMD/cmdi:Components"/>
     172    </lucene>
     173}}}
     174
     175With this setting, we see long ingesting times of 32727+17817+3889 sec, about 15 hours in total, but at least completed without errors. Disk usage, 6.2 gigabytes, is quite reasonable.
     176
     177Examples 2 and 3 return the same number of items and take a broadly similar amount of time as above (again with a high variance). For example 1, just like above, the variant using {{{ft:search}}} did not work, but the variant with {{{ft:query}}} was very quick, ~1 second mean response time, making it the quickest of all the queries.
     178
     179=== Summary: eXist ===
     180
     181Obviously, among these, only the root index alternative is an efficient configuration for eXist with text search. This highlights the configuration complexity compared to BaseX. Not using Lucene at all is an option only if no text searches are needed, as they are completely impractical without it. Using Lucene increases the ingestion time by a factor of ten, but does yield very fast text search. One problem specific to eXist is that the queries have to explicitly use the full text search functions to be effective, so it is not possible to use a generic query for all configurations.
     182
     183== !MarkLogic ==
     184
     185Installed !MarkLogic 5.0-3 x86 64.
     186
     187=== Importing ===
     188
     189Importing the collection (441553 documents) took around 1 hour and 50 minutes (6633.46464558 s). The /lat/MarkLogic/db/Forests/cmdi directory is 5,662,560 Kb large. This with the default setting for indexes etc.
     190
     191=== Queries ===
     192
     193==== Query 1 ====
     194
     195XQuery fulltext isn't supported by ML, but the following pure XQuery version works.
     196
     197{{{
     198xquery version "1.0-ml";
     199declare default element namespace "http://www.clarin.eu/cmd/";
     200for $doc in /CMD[exists(//text()[contains(lower-case(.), 'reindeer')])]
     201return $doc/Header/MdSelfLink
     202}}}
     203
     204Returns 51 results. However, it takes several minutes to finish.
     205
     206Full text search in ML [http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf] is not intermingled with XQuery (although there is an xquery module to trigger the search). However, one can create paths (see 3.1.3). But if you can then do matches on any descendants is not yet clear to me. For now I tried just the basic full text search and used the supplied match path to trim the result down to only matches in /CMD/Components.
     207
     208{{{
     209xquery version "1.0-ml";
     210declare default element namespace "http://www.clarin.eu/cmd/";
     211import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";
     212
     213let $res := search:search("reindeer", search:get-default-options(), 0, 100)//search:result[exists(.//search:match[contains(@path,'/*:CMD/*:Components')])]
     214return <res count="{count($res)}">{for $doc in $res return doc($doc/@uri)/CMD/Header/MdSelfLink}</res>
     215}}}
     216
     217This returns 51 matches in on average 0.1094399 second.
     218
     219{{{
     220run  1 : 0.107184
     221run  2 : 0.109612
     222run  3 : 0.110016
     223run  4 : 0.109592
     224run  5 : 0.10963
     225run  6 : 0.109767
     226run  7 : 0.109981
     227run  8 : 0.109469
     228run  9 : 0.109755
     229run 10 : 0.109393
     230average: 0.1094399
     231}}}
     232
     233==== Query 2 ====
     234
     235{{{
     236xquery version "1.0-ml";
     237declare default element namespace "http://www.clarin.eu/cmd/";
     238let $res := for $doc in /CMD[Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id = 'ISO639-3:rus'] return $doc/Header/MdSelfLink
     239return <res count="{count($res)}">{$res}</res>
     240}}}
     241
     242Works and returns 275 results in on average 0.0486831 second.
     243
     244{{{
     245run  1 : 0.049722
     246run  2 : 0.048433
     247run  3 : 0.048451
     248run  4 : 0.048408
     249run  5 : 0.048517
     250run  6 : 0.048512
     251run  7 : 0.048667
     252run  8 : 0.048971
     253run  9 : 0.04869
     254run 10 : 0.04846
     255average: 0.0486831
     256}}}
     257
     258==== Query 3 ====
     259
     260{{{
     261xquery version "1.0-ml";
     262declare default element namespace "http://www.clarin.eu/cmd/";
     263
     264let $res := for $doc in /CMD[Components/Session/MDGroup/Actors/Actor[Sex = 'Female' and (BirthDate castable as xs:date and year-from-date(BirthDate cast as xs:date) lt 1980)]] return $doc/Header/MdSelfLink
     265return <res count="{count($res)}">{$res}</res>
     266}}}
     267
     268Works and returns 5583 in on average 14.3633323 seconds.
     269
     270{{{
     271run  1 : 14.283688
     272run  2 : 14.259531
     273run  3 : 14.375429
     274run  4 : 14.334855
     275run  5 : 14.368139
     276run  6 : 14.427776
     277run  7 : 14.379539
     278run  8 : 14.410527
     279run  9 : 14.364832
     280run 10 : 14.429007
     281average: 14.3633323
     282}}}
     283
     284== BaseX ==
     285
     286This section is a preliminary analyzes on BaseX and shows the results which are currently already known. Tests were performed just on BaseX 7.2.1 and quitting both other databases processes (eXist and !MarkLogic) so BaseX would have the most resources possible.
     287
     288=== Creating the database ===
     289
     290Instead of creating a database and later import the documents into it, I specify the path where the documents are located so BaseX will ingest and index these documents upon creation time. The results follow:
     291{{{                     
     292Creating Database (ms)          3888975,43     ms
     293Indexing Text (ms)                          98506,15     ms
     294Indexing Attribute Values (ms)        7204,23     ms
     295Indexing Full-Text (ms)           161025,59     ms
     296                       
     297Total (s)               ~ 1h 9min 27s
     298}}}
     299
     300=== Initial disk usage ===
     301{{{
     302DB size (including Indexes and whitespace chop off)     4,9     GB
     303
     304Text index               151  MB
     305Attribte Index            21  MB
     306Full Text Index        505  MB
     307}}}
     308
     309=== Query 1 ===
     310
     311The query needed to be rewritten in order to take advantage of the existing indexes. The executed query was:
     312
     313{{{
     314declare default element namespace "http://www.clarin.eu/cmd/";
     315let $res := for $doc in collection('CMD')/CMD[Components//text() contains text 'reindeer' using case sensitive]
     316return $doc/Header/MdSelfLink
     317return <res count="{count($res)}">{$res} </res>
     318}}}
     319
     320This query will trigger the full text index and also return the number of matches as an attribute in its results root node. Odd enough in BaseX and running the query as case sensitive the number of matches is just 39!
     321Results for the first 10 runs are presented bellow:
     322{{{
     323        196,55  ms
     324          80,65 ms
     325          74,49 ms
     326          99,47 ms
     327          66,75 ms
     328          66,71 ms
     329          95,41 ms
     330          72,93 ms
     331         349,41 ms
     332          64,46 ms
     333
     334Average 116,683 ms
     335}}}
     336
     337=== Query 2 ===
     338
     339Again here the query was rewritten to:
     340
     341{{{
     342declare default element namespace "http://www.clarin.eu/cmd/";
     343let $res := for $doc in collection('CMD')/CMD[Components/Session/MDGroup/Content/Content_Languages/Content_Language/Id/text() = 'ISO639-3:rus']
     344return $doc/Header/MdSelfLink
     345return <res count="{count($res)}">{$res} </res>
     346}}}
     347
     348This query will trigger the text index and it returns 275 matches.
     349Results for the first 10 runs are presented bellow:
     350
     351{{{
     352        182,46  ms
     353        107,7   ms
     354          85,55 ms
     355          89,71 ms
     356        106,49  ms
     357          78,02 ms
     358        107,3   ms
     359          77,94 ms
     360          76,08 ms
     361          91,44 ms
     362
     363Average 100,269 ms
     364}}}
     365
     366=== Query 3 ===
     367
     368XQuery had to be rewritten to:
     369
     370{{{
     371declare default element namespace "http://www.clarin.eu/cmd/";
     372let $res := for $doc in collection('CMD')/CMD/Components/Session/MDGroup/Actors/Actor[Sex/text() = 'Female' and (BirthDate castable as xs:date and year-from-date(BirthDate cast as xs:date) lt 1980)]/../../../../..
     373return $doc/Header/MdSelfLink
     374return <res count="{count($res)}">{$res} </res>
     375}}}
     376
     377This query will trigger the text index for the comparison "Sex = 'Female'" and use regular walking for the "!BirthDate < 1980" condition (BaseX 7.2.1 supports value range indexed search based on string comparison, however the usage of "extract-year-from-date()" function makes it impossible for BaseX to use the index and I was not able to rewrite the query in such a way that I would get the same result set. If I remove the cast and the extract year function it yields to 8550 matches!).
     378It returns 5583 matches and the results for the first 10 runs are:
     379
     380{{{
     381        2164,9  ms
     382        1702,43 ms
     383        1713,03 ms
     384        1988,95 ms
     385        1617,57 ms
     386        1660,19 ms
     387        1633,67 ms
     388        1882,87 ms
     389        1618,74 ms
     390        1611,05 ms
     391
     392Average 1759,34 ms
     393}}}
     394
     395=== Measurements without disk cache ===
     396
     397The configurations used in the measurements presented bellow was precisely the same used above. The only difference is that here the disk cache was cleared before every query. This was done issuing the next command in the terminal:
     398
     399{{{sync ; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'}}}
     400
     401The results follow.
     402==== Query 1 ====
     403
     404{{{
     405        1283,96   ms
     406        1192,38   ms
     407        1483,72   ms
     408        1516,94   ms
     409        1233,98   ms
     410        1450,41   ms
     411        1233,96   ms
     412        1234,01   ms
     413        1233,99   ms
     414        1233,95   ms
     415
     416Average 1309,73 ms
     417}}}
     418
     419==== Query 2 ====
     420
     421{{{
     422        7977,63   ms
     423        7852,63   ms
     424        8003,11   ms
     425        7861,77   ms
     426        8061,57   ms
     427        8028,06   ms
     428        7994,73   ms
     429        7961,47   ms
     430        7936,43   ms
     431        8219,51   ms
     432
     433Average 7989,691        ms
     434}}}
     435
     436==== Query 3 ====
     437
     438{{{
     439        70524,28   ms
     440        70732,7     ms
     441        70484,29   ms
     442        71095,82   ms
     443        71520,79   ms
     444        70461,01   ms
     445        70048,49   ms
     446        70096,56   ms
     447        71845,04   ms
     448        71253,1     ms
     449
     450Average 70806,208   ms
     451}}}
     452
     453=== Other considerations ===
     454
     455Insertion and deletion times were not measured yet but it is known (from LEXUS experience) that the current version of BaseX is fairly unstable if use incremental indexes (UPDINDEX flag). Previous versions do not support incremental indexes at all. So BaseX usage with incremental indexes should be not considered for the time being. The alternative will be to run an db-optimize command in a regular bases, or every time the documents change. Which could be feasible or not depending of how often is the database expected to be written!
     456