This page serves to collect quality criteria for metadata records (and - by extension - profiles). There has been a number of such lists proposed, we aim to gather them here in one place. == Related work * [[CmdiBestPracticeGuide#Qualitycheck-list]] (criteria from there moved to this list) * [[MDQAS]] older draft for specifying a MD quality assessment service (criteria from there moved to this list) * [[Curation Module]] will operationalize the defined quality criteria and allow to check metadata records if they adhere to these criteria * [http://www.lrec-conf.org/proceedings/lrec2014/pdf/1011_Paper.pdf LREC2014 paper] by Thorsten Trippel et al. Towards automatic quality assessment of component metadata * Technical report by Marc Kemps-Snijders (2014): Metadata Quality Assurance == Criteria (Aspects with questionmark are subject to discussion) === Schema level strict criteria: * ratio of elements with data categories * presence of "required" data categories * coverage of mapping to vlo facets (possibly weighting of facets) soft: * public accessibility of the profile * status of the concept (approved/candidate/deprecated/...) * number of defined elements (identify a range - too many, too little) * number of unique components used * total number of components used * number of references to distinct data categories defined externally === Instance level strict: * availability of the (reference to) schema (i.e. both there is a reference to schema and the schema can be retrieved) * Does the schema reside in the Component Registry * validity of the record wrt to the schema * are the header fields complete? * is there a unique !MdSelfLink? * is there an !MdCollectionDisplayName? * is there a !MdProfile? * does it contain !ResourceProxy elements? * link to a LandingPage when available? * is there an indication of the mime type? * ResourceProxy with `@type=Resource or @type=Metadata` available? * URL-inspection - broken links * in MdSelfLink * in ResourceProxy * in the CMD_elements ( less important) * filled-in ratio / sparseness? - how many of the elements defined by schema are actually populated with information - what about files that hardly contain any information? - completeness concerning relative importance - coverage of the VLO facets (instance level) * size - e.g. overall size (measured in MB) * is the file too big to be useful? (VLO has strong rule on max size) [[BR]] -> suggest higher granularity) * max number of !ResourceProxy per record soft: * values conform to a controlled vocabulary (where applicable) * number of elements * Length of the strings in description fields * what is the information entropy? (lots of very similar files might be an indication of a suboptimal modelling) problematic: * ? publication before creation date * ? editing quality: typos, expected format(date, capitalization, pl vs sing) * the combination of values in a related categorical fields is not consistent (if field A has value x, field B should have y or z) interesting, but difficult: * accuracy - semantic distance between data and md? * conformance to expectation metrics? * coherence - different metadata fields describe the same object in the same way? * accessibility metrics - retrieval and cognitive)? check if you can retrieve the data, or there is login (and shib-login) * timeliness metrics - how quality changes in time?