» Dezrann documentation

Describing quality of a corpus

Research in Computational Music Analysis and related fields rely on high-quality musical corpora to model music concepts and develop and evaluate computational models, algorithms, and methodologies. Adhering to the FAIR principles, the community aims to share these corpora in a manner that is maximally reproducible.

However, the availability of musical data from various sources introduces inherent variations in quality, making it crucial to establish criteria for assessing and describing the quality of these corpora across multiple dimensions. These dimensions include the completeness and accuracy of scores, audio/video or other accompanying documents, synchronization between different media types, availability and quality of annotations, as well as piece-level and corpus-level metadata.

Through the evaluation and documentation of the quality of musical corpora, researchers can effectively navigate the complexities associated with working with heterogeneous data sources. Furthermore, this process facilitates the promotion of reliable and well-constructed corpora, ensuring their optimal utilization in the field of computational music analysis.

Below, each criterion is assigned a score ranging from 0 to 5, indicating the level of quality and suitability for publication and research purposes:

Quality ratings of 3 and 4 are already (very) good. A quality rating of 5 should be reserved for very good or exceptional materials, particularly those that are fully reproducible, open, and published.

This document was created as part of the data preparation process for the Dezrann platform. Some of the criteria below, indicated by italics, are specifically relevant to the integration of corpora into Dezrann. Evaluating these quality criteria for the Dezrann corpora indicates that the sources, annotations, and metadata are available in formats compatible with Dezrann.

However, these quality assessments can be applied in a broader context beyond the Dezrann platform. We welcome inputs from other teams or platforms on the criteria for evaluating quality elements. This collaborative approach ensures that we consider diverse perspectives and factors relevant to assessing the quality of musical corpora.

Example of .json file with quality information

We describe quality criteria for the corpus, as a whole, and for each piece.

{
    "corpus": {
        "title": "Fanny Mendelssohn Songs",
        (...)
        "quality:corpus": "2",
        "quality:corpus:metadata": "3"
    },

    "pieces": {
        "gondelied": {
            "opus": {
                "title": "Gondelied"
                (...)
            },

            "quality:annotation": "2",
            "quality:audio:synchro": "3",
            "quality:audio": "1",
            "quality:metadata": "3",
            "quality:musical-time": "3",
            "quality:score": "4",

            "sources": [
                (...)
            ]
        }
    }
}

As of 2024 Q4, we store all piece-related quality information directly in the piece dictionary, even when fields relate to a particular source (such as quality:audio).

Quality criteria for the corpus

quality:corpus

``Sources’’ can be scores, audio/video with good synchronization, images, or an annotation set.

quality:corpus:metadata

See metadata.md#corpus-metadata.

Quality criteria for each piece in the corpus

quality:score

quality:musical-time

The musical time (time signatures, measure numbers, repeat structure) should be indicated through a measure map (current discussions with M. Gotham and J. Hentschel).

quality:audio:synchro

As of 2025, the editor to synchronize repeats is still prototype, so the maximal score is here 3 when there are repeats.

quality:audio

Low/high-value is subjective: It is linked to audio quality, but even more on interpretation quality and/or musicological/historical value: a noisy “Gershwin by Gershwin” is probably more valuable than a 24-bit 96kHz recording from a piano beginner.

Audio (and video) content is particular, as some materials are only accessible via external sites, such as the YouTube player. Licenses are described elsewhere (licence for each source). However, content with a quality:audio of 5 must be available with an open license.

quality:annotation

quality:metadata

See metadata.md#opus-piece-metadata