wiki:DASISH/SpecificationDocument

Version 12 (modified by Przemek, 11 years ago) (diff)

--

DASISH WEB-ANNOTATOR

TLA

This document specifies a browser extension for annotating web-documents. We present the class structure of the implementation, describe the functionality from the user perspective and define the REST API.

Document version: 1.0

Date: 4 April 2013

Authors: Olha Shkaravska, Przemek Lenkiewicz, Menzo Windhouwer, Twan Goosen, Daan Broeder

Technical Summary.

The aim of this document is to give specifications for a web-annotating tool, which is to be developed within the DASISH project. The tool is a browser extension that allows to annotate fragments of web documents by tags, colors and text notes. The annotatable fragments may be texts and, on the later stages of development, graphical objects as well.

Initially the tool will allow to annotate only web-pages. Later we plan to extend the tool to annotate web-documents generated by linguistic software, e.g. EAF-files, created by ELAN (MPI Nijmegen), or lexical entries created by LEXUS (MPI Nijmegen). We do not want to limit annotatable objects by those generated by DASISH participants and plan to include external linguistic software to our case study.

The heart of the class schema of the project is class “Annotation”. An object of Annotation class is in the “target” relation with one or more objects of class “Source”. Semantics of an Annotation object is defined in its attribute “Body”. There are a few types of annotations bodies that express variety of the possibilities to annotate documents, from marking their fragments with simple text tags or colors, to putting arbitrary text notes.

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/Table1.png

An example of <sid> is given by the URI

http://tla.mpi.nl/#xpointer(//div[id='post-1157']/p/substring(.,33,3))

Here the part http://tla.mpi.nl/ is an <aoid> and the part xpointer(//div[id='post-1157']/p/substring(.,33,3)) is a <fid> . Since <vid> is not given, the <sid> refers to the latest version of the resource located at http://tla.mpi.nl/ .

<uid> is not mentioned explicitly below, as a parameter in the description of the REST service, because it is known from the session via “Shibboleth” identification procedure.

An owner is either the principal who has created the annotation or a principal to whom the ownership has been assigned.

Class Schema

The schema is based on the following interfaces and classes:

  • class Source represents (a specific fragment of) a specific version of an annotatable object; it contains information about this version, such as a time stamp, the lists of references to cashed representations;
  • class Annotation that contains the references to the annotation’s body (that contains the list of sources which it annotates), also the name of the owner, the lists of “readers” and “writers”;
  • interface Cached representation is a generic interface to be implemented by different representations of annotatable resources like serialized ones (e.g. XML-sed), media-files, screenshots;
  • interface Body (of annotation) (can be text, “like”, color, relation, etc.); contains the reference to the annotation.

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/Screen%20shot%202013-03-26%20at%203.18.43%20PM.png

We propose the following XML-serializations.

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/XML1.png

Note that the MIME type for MHTML is taken from Wikipedia, but there seems to be some discussion about this approach.

http://en.wikipedia.org/wiki/MHTML http://stackoverflow.com/questions/31250/content-type-for-mht-files

An annotation whose body is a binary relation (in this example “implies”) The intended meaning of the following example is that source1 implies source2.

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/XML2.png

An annotation whose body is “Note” (see the section about the types of annotations)

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/XML3.png

Note that “full” XML presentations as above may be returned by the corresponding GET methods. When we want to POST a new annotation then we know less known about it: for instance, it does not have an assigned identifier yet. We propose the following serialization of a new annotation:

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/XML4.png

Initial Annotation-Body Types

In the first prototype we plan to implement only 1-target annotations with the body type “Note”. From the user perspective they are just text notes about fragments of the document a-la comment in Word Documents, but displayed only in a list or as a tooltip (like the Wired Marker currently does). Balloon display as done in MS Word can be implemented in further stage.

In general we plan to implement the body types following the class diagram above. Recall that these body types, besides “Notes”, are: color, tag (a unary relation), labeled tag (a unary relation with parameters), binary relations. Below we present series of instances of these body types. Implementing these instances within our tool will have two-fold effect:

  • first, it will serve for user’s convenience by providing a drop-down menu of annotations once a fragment to be annotated is selected,
  • second, it will show that within the proposed class schema it is possible to create reasonable types of annotations,

To create an annotation, user needs to highlight the text and right-click the mouse. The creation menu should appear near the highlighted text (or on the right sub-panel of the whole panel). There the user can select the type of annotation and add other parameters when necessary. It may be possible to highlight the second fragment for binary relations using Shift(s).

For the existing annotations, left mouse click on the highlighted text triggers a “callout” (or a rectangular box, connected to the text fragment) with a short annotation description. It is applicable for tags and relations (see below). Right mouse click on the highlighted text triggers the context menu that contains the complete information about annotation: its author, date, its URI.

User Interface prototype

Main window view:

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/UI.png

Context menu:

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/MENU.png

REST API

Remark on document versioning. Web-documents exist in time, that is different versions of the document may exist under the same URI (<aoid>) in different moments of time. In the first prototype we implement only the simplest necessary handling of the versions of the web-document. In the first implementation we omit REST requests concerning versions and rely on local caching of old versions of annotated sources (as already exists as a feature in Wired_-Marker).

All information necessary to fulfill a PUT, POST or DELETE request, such as the URI of an annotated object, is given “serialized” in the request body, but not as request parameters in the request’s URI. If a POST (PUT, DELETE) method is performed, then in the case of success it returns a serialized information about the added (resp. updated, removed) resource together with a standard HTTP response code. The information includes: the resource ID, owner’s ID, time stamp, (possibly) the list of the <sid>’s of the target sources. For the full information the user will use GET on a just created/ updated annotation, already knowing its ID. In the case of failure the corresponding error message and error status are returned, e. g, 401 Unauthorized access. Only “owner” has DELETE rights.

Annotations

api/annotations

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/REST1.png

api/annotations/<aid>

It is assumed, that if the logged-in user <uid> has no “read” access to <aid> then GET methods over URI-s of the form api/annotations/<aid> will return error status Unauhtorized access 401, or similar. The same happens if the logged-in user <uid> has no “write” access to <aid> with PUT, POST and DELETE methods over the URI-s of the form api/annotations/<aid> .

The table below describes the behavior of the pair (method, URI), when user <uid> has authorized access to <aid>. Here “authorized access “ means that <uid> has “read” access for GET-methods, and “write” access for PUT, POST, and DELETE methods.

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/REST2.png

Sources

A source represents (a specific fragment of) a specific version of an annotatable object. For instance, if an annotatable object is a web-page that has 3 versions and users have annotated versions 1 and 3, then there are 2 sources in the Data Base that correspond to the “web-page”. Naturally, these sources represent versions 1 and 3.

Note that access to the whole document with <aoid> is possible via its <sid>=<aoid>#, with empty fragment descriptor.

Adding sources to the DataBase? and removing them is a responsibility of the DataBase? Management System. In fact, adding a source is a “side effect” of creating an annotation on a certain URI. Moreover, is the source with <sid>=<aoid>@<vid>#XXX is added to the DB, then the source <sid>=<aoid>@<vid># must be added as well, unless it is already in the DB.

If all the annotations that refer to a certain source are deleted, then the DB managing part deletes this source from the DB. A read-only REST API for inspecting Sources (incl. fragments) is needed.

Cached representations are managed by the client, therefore creation and deletion API is necessary. It is possible to store the cashed representation not only of the fragment precisely corresponding to an annotation target source, but of a larger fragment and even of the entire annotatable object.

api/sources

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/REST3.png

api/notebooks

https://trac.clarin.eu/raw-attachment/wiki/DASISH/SpecificationDocument/REST4.png

Attachments (22)