wiki:LanguageResourceSwitchboard/Hackathon

LRS Hackathon @ Centre meeting 2017

Planning for the Language Resource Switchboard hackathon at the Centre Meeting 2017 on Wednesday 17 May 2017.

More details can be found in the dedicated LRS-Hackathon GitHub repository.

Goal

Help developers of tools that process language resources (text, audio, video, annotations, ...) get their tool(s) connected to the LRS.

Task description

The Language Resource Switchboard (LRS) gives users with a language resource easy access to tools that can process the resource. The LRS is more than a Yellow Pages directory, however. Given a resource's language (e.g., English, Dutch) and its mime type (e.g., text/plain) it lists all tools capable of processing the resource, and sorts them in terms of the tasks they promise to achieve (e.g., constituency parsing, named entity recognition). Once the user selects a tool from the list, the LRS starts the tool. Here, it is the aim that the user, once directed to the tool, does not need to re-enter any information known to the switchboard (language, mime type, task). Rather, the tool shall start with the given context (but users are free to enter - tool specific - additional configuration parameters). For this to happen the LRS invokes the tool in question via an HTTP request where all relevant parameters are URL-encoded.

To integrate a tool with the LRS switchboard, the following tasks need to be tackled:

  • give the LRS metadata about the tool
    • language(s) of the resource it can process
    • mime type(s) of the resource it can process
    • the task it promises to achieve
    • the URL where the tool lives
    • the parameter names the switchboard should use to invoke the tool
  • modify the tool
    • being able to parse the URL used for tool invocation (reading parameters and their values)
    • advances the state of the tool by taking the parameters' values into account
      • e.g, if the tool's UI had a pull-down menu for language selection, then it should now show the language information passed in the URL

The first part is easy, you can send me the information via email, it is a nested JSON structure (more during the hackathlon, or via the LRS homepage (http://weblicht.sfs.uni-tuebingen.de/clrs) under Developer) The second part can only be done by the tool developer, and of course depends on the tool's design and implementation language.

For interested tool developers, please send an email to claus.zinn@uni-tuebingen.de with your tool's metadata (no JSON required, yet).

Schedule

(Also see Centre meeting agenda)

  • 9:00 Introduce LRS [CZ]
    • Demo VLO & b2drop bridges
    • Explain the API
    • Examples of adapted tools
    • Adaptation scenarios
  • 9:15 Present platform/fixture [TG]
    • Present sample resources
  • 9:25 Round of participants & tools
    • ~1 minute per participant: describe tool, input/output, programming language
  • 9:30 Coding
  • 10:30 Coffee break
  • 11:00 Coding (ct'd)
  • 12:00 Presentations & discussion (15m)
  • 12:15 Discuss next steps [TG/CZ] (15m)
  • 12:30 End of hackathon

Development 'fixture'

We need the following setup in place for the day of the hackathon and ideally for a while before and after that:

  • A test instance of the LRS (can be hosted by CLARIN) that is publicly accessible and can easily be reconfigured (metadata) by Claus
    • Also ideally a package that participants can use to run the LRS locally and connect to their tool running locally as well, in case they have no server to deploy changes to.
  • A VLO test instance that is connected to this LRS instance and publicly accessible
  • {Optionally} a b2drop instance that is connected to this LRS instance and publicly accessible (ideally also for uploads)
  • A set of sample resources from Europeana and/or CLARIN

Follow up

In the days/weeks/months after the hackathon, the following could or should happen:

  • Opportunity for (code) review of implemented integrations
  • Opportunity for beta testing
  • An online follow-up meeting
    • Brief meeting where participants can report on their progress, ask questions, discuss etc. Mainly to have a target to work towards to complete the work initiated at the hackathon.

TODOs

  • [x] Write a general task description (CZ)
  • [x] Collect and publish sample resources (TG)
  • [x] Set up and configure a LRS instance (CZ/TG)
  • [x] (Set up and) configure a VLO instance, import sample resources (TG)
  • [x] See if an (existing) b2drop instance could be used as a source (and/or drop target) for the resources (CZ)
  • [x] Prepare introduction (CZ, TG)
  • [x] Create a hackathon registration form (TG)
  • [x] Distribute the hackathon registration form (DvU/CZ)
  • [x] Create and print instructions (URLs, schedule) for participants

Meetings

  • Video conference: 11 April 2017, 14:00 CEST
  • Video conference: 26 April 2017, 14:00 CEST
  • Video conference: 15 May 2017, 10:00 CEST

Hackathon report

Participants

  1. Pavel Stranak (UFAL, Prague) stranak@ufal.mff.cuni.cz
  2. Amir Kamran (UFAL, Prague) kamran@ufal.mff.cuni.cz
  3. Bart Jongejan (Copenhagen) bartj@hum.ku.dk
  4. Martin Matthiesen (CSC Finland) Martin Matthiesen
  5. Tero Aalto (CSC Finland) tero.aalto@csc.fi
  6. Matej Durco (ACDH Austria) Matej.Durco@oeaw.ac.at
  7. Wolfgang Sauer (ACDH Vienna) wolfgang.sauer@oeaw.ac.at
  8. Tommi Pirinen (HZSK Hamburg) Tommi A Pirinen
  9. Riccardo Del Gratta (ILC/CNR Italy) riccardo.delgratta@ilc.cnr.it
  10. Krista Liin (Center of Estonian Language Resources) Krista Liin

Results

Three web services were connected to the dev switchboard to some degree during and after the session:

  1. UDPipe web service version, carries out tokenization, morphological analysis, tagging, lemmatization, dependency parsing. Required some ad-hoc mapping from language codes to models, a more sustainable solution is to be implemented either in the switchboard or on the side of the tool. The version with a user friendly front end is to be connected as well.
  2. ILC's tokenizer for various languages was connected as a proof of concept, but the service is not publicly available yet. Some complications with the request were encountered but resolved.
  3. HTML to plain text conversion (by Bart Jongejan), a service provided by CLARIN-DK.

There is a good potential for integration for tools from HZSK (conversion service for transcriptions) and ACDH (REST API + web views for named entity recognition, entity linking), but some adaptations on the applications themselves are required. CSC and the Center of Estonian Language Resources also see possibilities (e.g. Keeleliin) but have not made any concrete steps yet.

Many tools don't support processing of resources on basis of a URI through a query parameter of a GET request (yet), but do accept content to be processed via POST. We have to think about whether this should be resolved generically in the LRS, or by the individual tools, or maybe through some generic wrapper service.

A virtual follow-up meeting is to be scheduled in ~4 weeks time after the event (doodle).

Last modified 7 years ago Last modified on 05/18/17 14:07:40