Changeset 1045 for MDRepository


Ignore:
Timestamp:
01/09/11 20:28:15 (13 years ago)
Author:
vronk
Message:

mainly added new API-function: scanIndex
(group.xsl used for aggregating)

Location:
MDRepository/trunk/xquery
Files:
2 added
6 edited

Legend:

Unmodified
Added
Removed
  • MDRepository/trunk/xquery/README

    r958 r1045  
    1212        c) set admin pwd
    1313       
    14         ...?
     14        d) you may want to add memory to the JVM
     15           under bin/functions.d/eXist-settings.sh#set_java_options()
     16       
     17        e) you may also want to grow the cache in conf.xml
     18                 <db-connection cacheSize="48M" collectionCache="24M" database="native"       
     19                  where @cacheSize could be around 512M
     20      and @collectionCache should be around one third of the @cacheSize
     21
    1522
    16231. add scripts to: /db/clarin
     
    1825                + cmd-model.xqm has all the logic
    1926                + cmd-model.xql is the script being called as the interface
    20                 + cmd-stats.xql is meant for testing purposes, but not integrated yet
    21                 + init-cache.xql is meant for refreshing the cache with some long-running (resource-intensive) queries, meant to run once upon dataset change
     27         (+) cmd-stats.xql is meant for testing purposes, but not integrated yet
     28         (+) init-cache.xql is meant for refreshing the cache with some long-running (resource-intensive) queries, meant to run once upon dataset change
    2229
    23 2. add data to  /db/cmdi-mirror
    24          (the file-system structure will be reflected in the "collection"-structure within exist,
    25          however this is irrelevant for the MDRepository methods.
    26          Those rely on the linking via handles in MdSelfLink/ResourceRef and IsPartOf elements of the MDRecords.
    27          The handles in IsPartOf are redundant (necessary for faster collection-constraint search)
    28          and can be derived from the ResourceRef/MdSelfLink link.
    29          This can be done before storing the data in the repository,
    30          or after the import directly in the repository (scripts for this will be available soon)
    3130
    32 3. add a clarin-user in /db/system/users.xml
     312. add a clarin-user in /db/system/users.xml
    3332   (needed for writing into the cache)
    3433   + /db/clarin/writer.xml with given user, like this:
     
    3938
    4039
    41 4. create a collection for caching,
    42         eg: /db/common/clarin/freqs
     403. create a collection for caching,
     41        eg: /db/cache
    4342        this has to correspond to the entry in cmd-model.xqm:
    44         declare variable $cmd-model:commonFreqsPath as xs:string := "/db/common/clarin/freqs";
     43        declare variable $cmd-model:commonFreqsPath as xs:string := "/db/cache";
    4544       
    4645        If you change something, you have to manually clear the cache-collection.
     
    5049          for getCollections: collection{maxdepth}-{hash({collection-handle})}
    5150          for queryModel:   values{maxdepth}-{hash({simple xpath from q-param})}
     51
    5252       
    53 5. depending on your server-setup you should be able to get your first query under somewhere like:             
     534. define indices
     54         copy cmdi-mirror.xconf into /db/system/config/db/cmdi-mirror
     55         
     56         
     575. add data to  /db/cmdi-mirror
     58         (the file-system structure will be reflected in the "collection"-structure within exist,
     59         however this is irrelevant for the MDRepository methods.
     60         Those rely on the linking via handles in MdSelfLink/ResourceRef and <IsPartOf> elements of the MDRecords.
     61         The handles in <IsPartOf> are redundant (necessary for faster collection-constraint search)
     62         and can be derived from the ResourceRef/MdSelfLink link.
     63         This can be done before storing the data in the repository,
     64         or after the import directly in the repository (XUpdate-scripts for this will be available soon)
     65
     66         The top level collection record is by convention called colleciton_root.cmdi
     67         and is marked with: <IsPartOf>root</IsPartOf>
     68         (So every dataset (olac, lrt, imdi) has one such MDRecord.)
     69
     706. depending on your server-setup (port) you should be able to get your first query under somewhere like:               
    5471               
    5572        http://localhost:8680/exist/rest/db/clarin/cmd-model.xql?q=Components
     
    6178        Avoid starting multiple times.
    6279        You can see in the cache-collection, if the results are ready.
    63        
     80
     81
    6482       
    6583== test suite ==
  • MDRepository/trunk/xquery/build-tests.properties

    r827 r1045  
    22test.file=${cmdi-tests.basedir}/cmd-test.xml
    33tests.output = ${cmdi-tests.basedir}/output
     4//test.group = cmdi-tests-searchRetrieve
     5test.group = cmdi-tests-cc
  • MDRepository/trunk/xquery/build-tests.xml

    r827 r1045  
    4343        </typedef>
    4444                               
    45                                 <echo message="queries-file:${test.file}"/>
     45                                <echo message="queries-file#group:${test.file}#${test.group}"/>                                 
    4646        <delete dir="${tests.output}/temp" failonerror="false"/>
    4747        <mkdir dir="${tests.output}"/>
     
    5050       <test:benchmark outputFile="${tests.output}/cmdi-result.xml"
    5151                source="${test.file}"
    52                 group="cmdi-tests-cc"/>       
     52                group="${test.group}"/>       
    5353    </target>
    5454
  • MDRepository/trunk/xquery/cmd-model.xql

    r727 r1045  
    2929    return
    3030      if ($operation eq $cmd-model:getCollections) then
    31         cmd-model:get-collections($query-collections, $format, $max-depth)
     31                cmd-model:get-collections($query-collections, $format, $max-depth)
    3232      else if ($operation eq $cmd-model:queryModel) then
    33         cmd-model:query-model($cmd-index-path, $query-collections, $format, $max-depth)
    34     else if ($operation eq $cmd-model:searchRetrieve) then
    35       let $cql-query := request:get-parameter("query", "MDGroup/Actors/Actor"),
    36         $start-item := request:get-parameter("startRecord", 1),
    37         $end-item := request:get-parameter("iend", 50)
     33                cmd-model:query-model($cmd-index-path, $query-collections, $format, $max-depth)
     34      else if ($operation eq $cmd-model:scanIndex) then
     35                let $filter := request:get-parameter("filter", ""),
     36                $start-item := request:get-parameter("startItem", 1),
     37                $max-items := request:get-parameter("maxItems", 50),
     38                $sort := request:get-parameter("sort", 'text')
     39                return cmd-model:scan-index($cmd-index-path, $query-collections, $format, $start-item, $max-items, $sort)
     40          else if ($operation eq $cmd-model:searchRetrieve) then
     41        let $cql-query := request:get-parameter("query", "MDGroup/Actors/Actor"),
     42                        $start-item := request:get-parameter("startItem", 1),
     43                        $max-items := request:get-parameter("maxItems", 50)
    3844       
    39       return cmd-model:search-retrieve($cql-query, $query-collections, $format, xs:integer($start-item), xs:integer($end-item))
     45      return cmd-model:search-retrieve($cql-query, $query-collections, $format, xs:integer($start-item), xs:integer($max-items))
    4046    else
    4147      <error>Unknown operation</error>
  • MDRepository/trunk/xquery/cmd-model.xqm

    r959 r1045  
    1212declare variable $cmd-model:cmdiMirrorPath as xs:string := "/db/cmdi-mirror";
    1313declare variable $cmd-model:cachePath as xs:string := "/db/cache";
    14 
     14declare variable $cmd-model:groupXsl := doc('/db/clarin/group.xsl');
    1515declare variable $cmd-model:getCollections as xs:string := "getCollections";
    1616declare variable $cmd-model:queryModel as xs:string := "queryModel";
     17declare variable $cmd-model:scanIndex as xs:string := "scanIndex";
    1718declare variable $cmd-model:searchRetrieve as xs:string := "searchRetrieve";
    1819
     
    2829declare variable $cmd-model:responseFormatText as xs:string := "text";
    2930
     31declare variable $cmd-model:scanSortText as xs:string := "text";
     32declare variable $cmd-model:scanSortSize as xs:string := "size";
     33
    3034declare variable $cmd-model:collectionDocName as xs:string := "collection.xml";
    3135
     
    3438declare variable $cmd-model:xmlExt as xs:string := ".xml";
    3539
     40declare variable $cmd-model:maxDepth as xs:integer := 8;
    3641declare variable $cmd-model:valuesLimit as xs:integer := 100;
    3742
     43
     44(:~
     45  API function getCollections.
     46:)
     47declare function cmd-model:get-collections($collections as xs:string+, $format as xs:string, $max-depth as xs:integer) as item() {
     48  let $name := cmd-model:gen-cache-id("collection", $collections, xs:string($max-depth)),
     49    $doc :=
     50    if (cmd-model:is-in-cache($name)) then
     51       cmd-model:get-from-cache($name)
     52    else
     53      let $data := cmd-model:colls($collections, $max-depth)
     54        return cmd-model:store-in-cache($name, $data)
     55  return
     56    cmd-model:serialise-as($doc, $format)
     57};
    3858
    3959(:~
     
    5474
    5575(:~
    56   API function getCollections.
    57 :)
    58 declare function cmd-model:get-collections($collections as xs:string+, $format as xs:string, $max-depth as xs:integer) as item() {
    59   let $name := cmd-model:gen-cache-id("collection", $collections, xs:string($max-depth)),
    60     $doc :=
    61     if (cmd-model:is-in-cache($name)) then
    62        cmd-model:get-from-cache($name)
    63     else
    64       let $data := cmd-model:colls($collections, $max-depth)
    65         return cmd-model:store-in-cache($name, $data)
     76  API function scanIndex.
     77two phases:
     78        1.one create full index for given path/element (and cache)
     79        2. select wished subsequence (on second call, only the second step is performed)
     80:)
     81declare function cmd-model:scan-index($q as xs:string, $collection as xs:string+, $format as xs:string, $start-item as xs:integer, $max-items as xs:integer, $p-sort as xs:string?) as item()? {
     82
     83  let $qa := tokenize($q,'='),
     84         $cmd-index-path := $qa[1],
     85         $filter := ($qa[2],'')[1],
     86         $sort := if ($p-sort eq $cmd-model:scanSortText or $p-sort eq $cmd-model:scanSortSize) then $p-sort else $cmd-model:scanSortText,
     87          $name := cmd-model:gen-cache-id("index", ($collection, $cmd-index-path),"1"),
     88    (: skip cache $doc := cmd-model:values($cmd-index-path, $collection) :)
     89    $doc := if (cmd-model:is-in-cache($name)) then
     90      cmd-model:get-from-cache($name)
     91    else 
     92      let  $data := cmd-model:values($cmd-index-path, $collection)
     93        return cmd-model:store-in-cache($name, $data)
     94
     95        (: extract the required subsequence (according to given sort) :)
     96        let $res-term := transform:transform($doc,$cmd-model:groupXsl,
     97                        <parameters><param name="mode" value="subsequence"/>
     98                                                <param name="sort" value="{$sort}"/>
     99                                                <param name="filter" value="{$filter}"/>
     100                                                <param name="start-item" value="{$start-item}"/>
     101                                                <param name="max-items" value="{$max-items}"/>
     102                        </parameters>),
     103                $count-items := count($res-term/v),
     104                $colls := if (fn:empty($collection)) then '' else fn:string-join($collection, ","),
     105                $created := fn:current-dateTime(),
     106                $scan-clause := concat($cmd-index-path, '=', $filter),
     107                $res := <Terms colls="{$colls}" created="{$created}" count_items="{$count-items}"
     108                                        start-item="{$start-item}" max-items="{$max-items}" sort="{$sort}" scanClause="{$scan-clause}"  >{$res-term}</Terms>
     109
     110(:      let     $result-count := $doc/Term/@count,
     111    $result-seq := fn:subsequence($doc/Term/v, $start-item, $end-item),
     112        $result-frag := ($doc/Term, $result-seq),
     113    $seq-count := fn:count($result-seq) :)
     114
    66115  return
    67     cmd-model:serialise-as($doc, $format)
     116    cmd-model:serialise-as($res, $format)       
    68117};
    69118
     
    82131      for $coll in $collections return util:eval(fn:concat("$collection/ft:query(descendant::IsPartOf, <term>", xdb:decode($coll) ,"</term>)/ancestor-or-self::CMD", $sanitized-query))
    83132
    84     let $result-count := fn:count($results),
     133        let     $result-count := fn:count($results),
    85134    $result-seq := fn:subsequence($results, $start-item, $end-item),
    86135    $seq-count := fn:count($result-seq),
    87     $end-time := util:system-dateTime(),
    88     $result-fragment :=
     136        $end-time := util:system-dateTime()
     137
     138        let $summary-fragment :=
     139                if (contains($format,'withSummary')) then
     140                        let $used-profiles := for $profile in distinct-values($results//Components/concat(child::element()/name(),'##',../Header/MdProfile))
     141                                                        let $profile-id := substring-after($profile,'##'), $profile-name := substring-before($profile,'##')
     142                                                        return <profile id="{$profile-id}" name="{$profile-name}" count="{count($results//Components[concat(child::element()/name(),'##',../Header/MdProfile) eq $profile])}" />,
     143                                $end-time2 := util:system-dateTime(),
     144                                $result-summary := cmd-model:elem-r($result-seq//Components, "Components", $cmd-model:maxDepth, $cmd-model:maxDepth),
     145                        $end-time3 := util:system-dateTime(),
     146                                $duration :=  concat(($end-time - $start-time),", ", ($end-time2 - $start-time),", ", ($end-time3 - $start-time))
     147                        return (<duration>{$duration}</duration>, <usedProfiles>{$used-profiles}</usedProfiles>,<resultSummary>{$result-summary}</resultSummary>)
     148                else <duration>{$end-time - $start-time}</duration>
     149               
     150    let $result-fragment :=
    89151    <searchRetrieveResponse>
    90152      <numberOfRecords>{$result-count}</numberOfRecords>
     
    92154      <extraResponseData>
    93155        <returnedRecords>{$seq-count}</returnedRecords>
    94         <duration>{$end-time - $start-time}</duration>
     156                {$summary-fragment}
    95157      </extraResponseData>
    96158      <records>
     
    106168(:
    107169  **********************
    108   queryModel - subfunctions
     170  queryModel, scanIndex - subfunctions
    109171:)
    110172
     
    154216            (for $elname in $subs[. != '']
    155217            return
    156               cmd-model:elem-r(util:eval(concat("$path-nodes/", $elname)), concat($path, '/', $elname), $max-depth, $depth - 1),
    157               if ($max-depth eq 1 and $text-count gt 0) then cmd-model:values($path-nodes) else ())
     218              cmd-model:elem-r(util:eval(concat("$path-nodes/", $elname)), concat($path, '/', $elname), $max-depth, $depth - 1)
     219                        (: values moved to own function: scanIndex
     220                      if ($max-depth eq 1 and $text-count gt 0) then cmd-model:values($path-nodes) else ()) :)
     221                                                )
    158222          else 'maxdepth'
    159223        }</Term>
    160 };
    161 
    162 declare function cmd-model:values($nodes as node()*) as node()* {
    163 let $keys := distinct-values($nodes/text())
    164 let $values := for $key at $pos in $keys
    165   let $kcount := count($nodes[. eq $key])
    166     order by lower-case($key) ascending
    167     return <v key="{$key}" cnt="{$kcount}" />
    168 return
    169   if ($cmd-model:valuesLimit eq 0) then $values
    170   else
    171   subsequence($values, 1, $cmd-model:valuesLimit)
    172224};
    173225
     
    178230        return util:node-xpath($anc)
    179231        }</Term>
     232};
     233
     234declare function cmd-model:collect-nodes($collections as xs:string+, $path as xs:string) as element()* {
     235  let $collection := collection($cmd-model:cmdiMirrorPath),
     236    $path-nodes :=
     237    if ($collections[1] eq $cmd-model:collectionRoot) then
     238      util:eval(fn:concat("$collection/descendant-or-self::", $path))
     239    else
     240      for $coll in $collections
     241      return
     242        util:eval(fn:concat("$collection/ft:query(descendant::IsPartOf, <query><term>", xdb:decode($coll), "</term></query>)/ancestor-or-self::CMD/descendant-or-self::", $path))
     243   
     244        return $path-nodes
     245};
     246
     247declare function cmd-model:values($path as xs:string,$collections as xs:string+) as element() {
     248
     249        let $nodes := cmd-model:collect-nodes($collections, $path),
     250(:              $term := <Term path="{fn:concat("//", $path)}" name="{(text:groups($path, "/([^/]+)$")[last()],$path)[1] }" >{$nodes}</Term>
     251                @name is added in xslt:)
     252                $term := <Term path="{fn:concat("//", $path)}"  >{$nodes}</Term>
     253
     254        (: use XSLT-2.0 for-each-group functionality to aggregate the values of a node - much, much faster, than XQuery :)
     255        return transform:transform($term,$cmd-model:groupXsl, ())
     256
    180257};
    181258
  • MDRepository/trunk/xquery/cmd-test.xml

    r830 r1045  
    33    <configuration>
    44        <!--  <connection id="con" user="admin" password="" base="xmldb:exist://embedded-eXist-server"/> -->
    5         <connection id="con" user="admin" password="pwd" base="xmldb:exist://localhost:8680/exist/xmlrpc/"/>
     5        <!-- <connection id="con" user="admin" password="M0dest7db" base="xmldb:exist://localhost:8680/exist/xmlrpc"/> -->
     6        <connection id="con" user="admin" password="sen71blE8dba" base="xmldb:exist://clarin.aac.ac.at/exist/xmlrpc"/>
    67        <action name="sequence" class="org.exist.performance.ActionSequence"/>
    78        <action name="create-collection" class="org.exist.performance.actions.CreateCollection"/>
     
    4041                                <thread name="thread1" connection="con">
    4142            <sequence repeat="20" description="actual queries on the mdrecords">             
    42                                         <xquery collection="/db/cmdi-mirror" query="">
     43                                        <xquery collection="db/cmdi-mirror" query="">
    4344                                                import module namespace cmd-model = "http://spraakbanken.gu.se/clarin/xquery/model"
    4445                                                                        at "xmldb:exist:///db/clarin/cmd-model.xqm";
    4546                                                 cmd-model:search-retrieve("//title[contains(.,'a')]", 'root', 'xml', 1,10)
    4647                                        </xquery>
    47                                         <xquery collection="/db/cmdi-mirror" query="//ft:query(*, <query><term>Language</term></query>)" />
    48                                         <xquery collection="/db/cmdi-mirror" query="ft:query(descendant::IsPartOf, <query><term>clarin-at:aac-test-corpus</term></query>)" />
     48                                        <xquery collection="/db/cmdi-mirror" query="//Components/ft:query(., 'Language')" />
     49                                        <xquery collection="/db/cmdi-mirror" query="ft:query(descendant::IsPartOf, 'clarin-at:aac-test-corpus')" />
    4950                                        <xquery collection="/db/cmdi-mirror" query="//idno[.='4197']" />
    5051                                        <xquery collection="/db/cmdi-mirror" query="//date[.>'1920']" />
    5152                                        <xquery collection="/db/cmdi-mirror" query="//ResourceType[.='Metadata']" />
    5253                                        <xquery collection="/db/cmdi-mirror" query="//sourceDesc/biblStruct/monogr/title[contains(.,'rat')]" />
    53                                         <xquery collection="/db/cmdi-mirror" query="//CMD/*[.//teiHeader//monogr/title[contains(.,'t')]][.//teiHeader//imprint/date[.>'1930']][.//teiHeader//imprint/date[.<'1935']]" />                                       
     54                                        <xquery collection="/db/cmdi-mirror" query="//CMD/*[.//teiHeader//monogr/title[contains(.,'t')]][.//teiHeader//imprint/date[.&gt;'1930']][.//teiHeader//imprint/date[.&lt;'1935']]" />                                 
    5455                        </sequence>
    5556                                </thread>
Note: See TracChangeset for help on using the changeset viewer.