wiki:Cmdi/QualityCriteria

This page serves to collect quality criteria for metadata records (and - by extension - profiles).

There has been a number of such lists proposed, we aim to gather them here in one place.

Related work

  • CmdiBestPracticeGuide#Qualitycheck-list (criteria from there moved to this list)
  • MDQAS older draft for specifying a MD quality assessment service (criteria from there moved to this list)
  • Curation Module will operationalize the defined quality criteria and allow to check metadata records if they adhere to these criteria
  • LREC2014 paper by Thorsten Trippel et al. Towards automatic quality assessment of component metadata
  • Technical report by Marc Kemps-Snijders (2014): Metadata Quality Assurance

Criteria

(Aspects with questionmark are subject to discussion)

Schema level

strict criteria:

  • ratio of elements with data categories
  • presence of "required" data categories
  • coverage of mapping to vlo facets (possibly weighting of facets)

soft:

  • public accessibility of the profile
  • status of the concept (approved/candidate/deprecated/...)
  • number of defined elements (identify a range - too many, too little)
  • number of unique components used
  • total number of components used
  • number of references to distinct data categories defined externally

Instance level

strict:

  • availability of the (reference to) schema (i.e. both there is a reference to schema and the schema can be retrieved)
  • Does the schema reside in the Component Registry
  • validity of the record wrt to the schema
  • are the header fields complete?
    • is there a unique MdSelfLink?
    • is there an MdCollectionDisplayName?
    • is there a MdProfile?
  • does it contain ResourceProxy elements?
    • link to a LandingPage? when available?
    • is there an indication of the mime type?
    • ResourceProxy? with @type=Resource or @type=Metadata available?
  • URL-inspection - broken links
  • filled-in ratio / sparseness?
    • how many of the elements defined by schema are actually populated with information
    • what about files that hardly contain any information?
    • completeness concerning relative importance
    • coverage of the VLO facets (instance level)
  • size - e.g. overall size (measured in MB)
    • is the file too big to be useful? (VLO has strong rule on max size)
      -> suggest higher granularity)
    • max number of ResourceProxy per record

soft:

  • values conform to a controlled vocabulary (where applicable)
  • number of elements
  • Length of the strings in description fields
  • what is the information entropy? (lots of very similar files might be an indication of a suboptimal modelling)

problematic:

  • ? publication before creation date
  • ? editing quality: typos, expected format(date, capitalization, pl vs sing)
  • the combination of values in a related categorical fields is not consistent (if field A has value x, field B should have y or z)

interesting, but difficult:

  • accuracy - semantic distance between data and md?
  • conformance to expectation metrics?
  • coherence - different metadata fields describe the same object in the same way?
  • accessibility metrics - retrieval and cognitive)? check if you can retrieve the data, or there is login (and shib-login)
  • timeliness metrics - how quality changes in time?
Last modified 8 years ago Last modified on 11/11/15 15:05:07

Attachments (1)