1 | #################################################### |
---|
2 | THIS CODE IS OBSOLETED!! |
---|
3 | it has been integrated and further developed within: |
---|
4 | https://github.com/vronk/corpus_shell |
---|
5 | and |
---|
6 | https://github.com/vronk/SADE |
---|
7 | #################################################### |
---|
8 | |
---|
9 | == CLARIN MDRepository == |
---|
10 | Steps to setup and run the repository |
---|
11 | |
---|
12 | 0. prerequisites + install |
---|
13 | a) be sure to use java-jdk 1.6 (we experienced strange java-errors with 1.5) |
---|
14 | |
---|
15 | b) install: |
---|
16 | http://exist-db.org/quickstart.html#sect2 |
---|
17 | |
---|
18 | java -jar eXist-{version}.jar -p {install-dir} |
---|
19 | |
---|
20 | c) set admin pwd |
---|
21 | |
---|
22 | d) you may want to add memory to the JVM |
---|
23 | under bin/functions.d/eXist-settings.sh#set_java_options() |
---|
24 | |
---|
25 | e) you may also want to grow the cache in conf.xml |
---|
26 | <db-connection cacheSize="48M" collectionCache="24M" database="native" |
---|
27 | where @cacheSize could be around 512M |
---|
28 | and @collectionCache should be around one third of the @cacheSize |
---|
29 | |
---|
30 | |
---|
31 | 1. add scripts to: /db/clarin |
---|
32 | |
---|
33 | + cmd-model.xqm has all the logic |
---|
34 | + cmd-model.xql is the script being called as the interface |
---|
35 | + groups.xsl |
---|
36 | (+) cmd-stats.xql is meant for testing purposes, but not integrated yet |
---|
37 | (+) init-cache.xql is meant for refreshing the cache with some long-running (resource-intensive) queries, meant to run once upon dataset change |
---|
38 | |
---|
39 | |
---|
40 | 2. add a clarin-user in /db/system/users.xml |
---|
41 | (needed for writing into the cache) |
---|
42 | + /db/clarin/writer.xml with given user, like this: |
---|
43 | <write> |
---|
44 | <write-user>clarin</write-user> |
---|
45 | <write-user-cred>{PASSWORT}</write-user-cred> |
---|
46 | </write> |
---|
47 | |
---|
48 | |
---|
49 | 3. create a collection for caching, |
---|
50 | eg: /db/cache |
---|
51 | this has to correspond to the entry in cmd-model.xqm: |
---|
52 | declare variable $cmd-model:commonFreqsPath as xs:string := "/db/cache"; |
---|
53 | |
---|
54 | If you change something, you have to manually clear the cache-collection. |
---|
55 | |
---|
56 | Queries on queryModel- and getCollections-interfaces are being cached. |
---|
57 | The key is: |
---|
58 | for getCollections: collection{maxdepth}-{hash({collection-handle})} |
---|
59 | for queryModel: values{maxdepth}-{hash({simple xpath from q-param})} |
---|
60 | |
---|
61 | |
---|
62 | 4. define indices |
---|
63 | copy cmdi-mirror.xconf into /db/system/config/db/cmdi-mirror |
---|
64 | |
---|
65 | |
---|
66 | 5. add data to /db/cmdi-mirror |
---|
67 | (the file-system structure will be reflected in the "collection"-structure within exist, |
---|
68 | however this is irrelevant for the MDRepository methods. |
---|
69 | Those rely on the linking via handles in MdSelfLink/ResourceRef and <IsPartOf> elements of the MDRecords. |
---|
70 | The handles in <IsPartOf> are redundant (necessary for faster collection-constraint search) |
---|
71 | and can be derived from the ResourceRef/MdSelfLink link. |
---|
72 | This can be done before storing the data in the repository, |
---|
73 | or after the import directly in the repository (XUpdate-scripts for this will be available soon) |
---|
74 | |
---|
75 | The top level collection record is by convention called colleciton_root.cmdi |
---|
76 | and is marked with: <IsPartOf>root</IsPartOf> |
---|
77 | (So every dataset (olac, lrt, imdi) has one such MDRecord.) |
---|
78 | |
---|
79 | 6. depending on your server-setup (port) you should be able to get your first query under somewhere like: |
---|
80 | |
---|
81 | http://localhost:8680/exist/rest/db/clarin/cmd-model.xql?q=Components |
---|
82 | (queryModel is the default operation) |
---|
83 | |
---|
84 | http://localhost:8680/exist/rest/db/clarin/cmd-model.xql?operation=getCollections&collection= |
---|
85 | |
---|
86 | These queries may take some time, when run first time, so be patient. |
---|
87 | Avoid starting multiple times. |
---|
88 | You can see in the cache-collection, if the results are ready. |
---|
89 | |
---|
90 | |
---|
91 | |
---|
92 | == test suite == |
---|
93 | THIS IS CURRENTLY BEING DEVLEOPED! NOT SAFELY USABLE YET! |
---|
94 | |
---|
95 | own build-file: build-tests.xml |
---|
96 | based on exist's performance.xml sub-build-file |
---|
97 | imports main exist build-file. |
---|
98 | This yields problems with basedir for the imported build-files |
---|
99 | |
---|
100 | The simplest solution I could find is to set the basedir as property on command line: |
---|
101 | |
---|
102 | ant -f build-tests.xml -Dbasedir=C:/apps/exist benchmark |
---|
103 | |
---|
104 | The other options are to be set in build-tests.properties! |
---|
105 | |
---|
106 | actual queries for testing/benchmarking are written in cmd-test.xml |
---|