This page serves to collect quality criteria for metadata records (and - by extension - profiles).
There has been a number of such lists proposed, we aim to gather them here in one place.
Related work
- CmdiBestPracticeGuide#Qualitycheck-list (criteria from there moved to this list)
- MDQAS older draft for specifying a MD quality assessment service (criteria from there moved to this list)
- Curation Module will operationalize the defined quality criteria and allow to check metadata records if they adhere to these criteria
- LREC2014 paper by Thorsten Trippel et al. Towards automatic quality assessment of component metadata
- Technical report by Marc Kemps-Snijders (2014): Metadata Quality Assurance
Criteria
(Aspects with questionmark are subject to discussion)
Schema level
strict criteria:
- ratio of elements with data categories
- presence of "required" data categories
- coverage of mapping to vlo facets (possibly weighting of facets)
soft:
- public accessibility of the profile
- status of the concept (approved/candidate/deprecated/...)
- number of defined elements (identify a range - too many, too little)
- number of unique components used
- total number of components used
- number of references to distinct data categories defined externally
Instance level
strict:
- availability of the (reference to) schema (i.e. both there is a reference to schema and the schema can be retrieved)
- Does the schema reside in the Component Registry
- validity of the record wrt to the schema
- are the header fields complete?
- is there a unique MdSelfLink?
- is there an MdCollectionDisplayName?
- is there a MdProfile?
- does it contain ResourceProxy elements?
- link to a LandingPage? when available?
- is there an indication of the mime type?
- ResourceProxy? with
@type=Resource or @type=Metadata
available?
- URL-inspection - broken links
- in MdSelfLink?
- in ResourceProxy?
- in the CMD_elements ( less important)
- filled-in ratio / sparseness?
- how many of the elements defined by schema are actually populated with information
- what about files that hardly contain any information?
- completeness concerning relative importance
- coverage of the VLO facets (instance level)
- size - e.g. overall size (measured in MB)
- is the file too big to be useful? (VLO has strong rule on max size)
-> suggest higher granularity) - max number of ResourceProxy per record
- is the file too big to be useful? (VLO has strong rule on max size)
soft:
- values conform to a controlled vocabulary (where applicable)
- number of elements
- Length of the strings in description fields
- what is the information entropy? (lots of very similar files might be an indication of a suboptimal modelling)
problematic:
- ? publication before creation date
- ? editing quality: typos, expected format(date, capitalization, pl vs sing)
- the combination of values in a related categorical fields is not consistent (if field A has value x, field B should have y or z)
interesting, but difficult:
- accuracy - semantic distance between data and md?
- conformance to expectation metrics?
- coherence - different metadata fields describe the same object in the same way?
- accessibility metrics - retrieval and cognitive)? check if you can retrieve the data, or there is login (and shib-login)
- timeliness metrics - how quality changes in time?
Last modified 8 years ago
Last modified on 11/11/15 15:05:07
Attachments (1)
-
The Metadata Quality Assurance-final_MKS.docx (5.2 MB) - added by 9 years ago.
Marc Kemps-Snijders (2014): Metadata Quality Assurance, technical report