Changes between Version 1 and Version 2 of SoftwareDevelopment/Archive/XML database for CMDI


Ignore:
Timestamp:
06/12/12 13:16:21 (12 years ago)
Author:
dietuyt
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SoftwareDevelopment/Archive/XML database for CMDI

    v1 v2  
    66
    77''Responsible for this page: [wiki:larlam Lari] [[BR]]
    8 Last content check: 2012-05-03''
    9 {{{
    10 #!html
    11 <h3>Purpose</h3>
    12 }}}
    13 
    14 {{{
    15 #!comment
    16 Replace the current purpose with your own (not this line ;))
    17 }}}
    18 
    19 Status = draft.
     8Last content check: 2012-06-12''
     9
     10A copy of this page has been placed at http://trac.clarin.eu/wiki/XML%20database%20for%20CMDI
     11
    2012
    2113{{{
     
    2618[[PageOutline(1-2, , inline)]]
    2719
    28 {{{
    29 #!comment
    30 Obviously, your page starts below this block
    31 }}}
     20
     21
     22= Table of Results =
     23
     24This is just a summary of the results. See below for details.
     25
     26A single number indicates a single measurement; otherwise the number shown is the average and ''n'' is the number of measurements. An interval of 95% confidence in the mean is shown for those cases where there are enough measurements to support the calculation. A number in '''bold''' indicates the best result in that row. For query 1 there is no statistically significant "best" result.
     27
     28|| ||= eXist (root index) =||= BaseX =||= !MarkLogic =||= Notes =||
     29||= Importing Time (s) =|| 52909 (''n=2'') || '''4167''' || 6633 ||||
     30||= Disk usage (GiB) =|| 6.0 || '''4.9''' || 5.4 || Includes database and indices ||
     31||= Query 1 (ms) =|| 226 ± 104 (''n=10'') || 116.7 ± 177.7 (''n=10'') || 109.4 ± 1.6 (''n=10'') ||BaseX variance is so high that with ''n=10'' the result is statistically useless||
     32||= Query 2 (ms) =|| 3876 ± 818 (''n=10'') || 100.3 ± 61.6 (''n=10'') || '''48.7 ± 0.8''' (''n=10'') ||||
     33||= Query 3 (ms) =|| 11347 ± 669 (''n=10'') || '''1759 ± 373''' (''n=10'') || 14363 ± 111 (''n=10'') ||||
     34
     35'''Many caveats apply''', including but not limited to these:
     36 * The numbers for eXist are collected from the XQuery IDE (eXide), so they likely overstate the time taken compared to the same query via direct API access.
     37 * According to Leif-Jöran Olsson of the eXist project, having the journal and database files on different filesystems (drives) could make a substantial performance difference for eXist. This has not been tested.
     38 * Mr. Olsson contends that the importing time we measured is an order of magnitude higher than normal. However, the time was stable in a repeated measurement. There is no obvious explanation for the discrepancy.
     39 * Limitations apply to using text indices in XQuery statements in !MarkLogic; see more below.
     40
     41----
    3242
    3343= Questions =
     
    3949  * other stragegies?
    4050* How do eXist indexes scale with the amount of CMDI files fed to it?
    41   * currently: about ~~220.000~~ [http://catalog.clarin.eu/ds/vlo/ 420.000] CMDI files
     51  * currently: about [http://catalog.clarin.eu/ds/vlo/ 420.000] CMDI files
    4252  * what for 500.000? 2 mio?
    4353* BaseX: can we import the [http://catalog.clarin.eu/oai-harvester/resultsets/ 420.000 CMDI] files in BaseX?
     
    4656* If eXist and BaseX are not sufficient, which other options are there?
    4757
     58
    4859== Typical Queries ==
    4960
     
    108119
    109120'''Note''' about numbers of ''reindeer'' matches: The word "reindeer" appears in 41 records. There are 45 records with the ''word'' reindeer in case-insensitive mode. And finally, a case insensitive grep of ''reindeer'' finds 51 matches (6 of them are not "words" -- generally appearing as part of a pathname). So all our search results seem to be right even though they differ. Keep this in mind when comparing numbers!
    110 
    111 The time measurements should only be taken as a guideline rather than as exact figures, as there was no attempt to collect means of multiple measurements etc.
    112121
    113122== eXist ==
     
    454463
    455464Insertion and deletion times were not measured yet but it is known (from LEXUS experience) that the current version of BaseX is fairly unstable if use incremental indexes (UPDINDEX flag). Previous versions do not support incremental indexes at all. So BaseX usage with incremental indexes should be not considered for the time being. The alternative will be to run an db-optimize command in a regular bases, or every time the documents change. Which could be feasible or not depending of how often is the database expected to be written!
    456