» Dezrann corpus/developer documentation

Adding and maintaining a new corpus on Dezrann

A corpus on Dezrann should contain score(s), set(s) of analyses/annotations, synchronized audio(s) (or at least two of the three), and appropriate metadata. As far as possible, these files should be available under open-data licenses, as described on Open Science and Licenses. We try to add corpus in a reproducible way, as for the public corpora available on the platform. Data you use or you create (scores, annotations, audios, metadata) have to be available within a git or through any stable URL.

As a corpus curator/maintainer, your responsability is mostly to prepare, update, and maintain a corpus description file such as metadata/my-corpus.json giving all the information and pointing to some sources. This involves the following steps.

Tutorial: follow the 🏝️ points

Step 1. Data/metadata preparation

(Sources data)

Prepare scores (preferably .mei, but also .musicxml or other symbolic formats, processed with Verovio). See Scores preparation;
Prepare annotation data. This may come either from external data and converted to the json .dez format representing labels, or can be done later, once the scores are on Dezrann;
Prepare audio/video and synchronization (preferably open-data audio/video, or possibly YouTube links). The synchronization can either come from external data, or can be done later, once the scores are on Dezrann;

(Metadata)

Prepare the corpus description file metadata/my-corpus.json combining corpus and piece metadata, as described on Specifying corpus and piece data and metadata.
🏝️ You can start from the metadata/template-one-piece.json template that is a quite short example with one piece and four sources (score, audio, YT video, analysis). These sources are now unrelated, pick what you want for your piece.

Step 2. First build of the corpus

When the metadata/my-corpus.json file is ready:

Either contact us with the corpus description file and/or open a MR with this file in the /metadata directory. As of Q2 2025, uploading the corpus on Dezrann now involves manual steps on the server, and the process is, so this the preferred step. Better tools are scheduled for Q4 2025.
🏝️ New (Q3 2025)! And/or test directly your corpus on the Dezrann test server with:

curl -sS --request POST --url https://test-ws.dezrann.net/corpus --header 'Content-Type: multipart/form-data' --form metadata=@my-corpus.json

And/or, following examples on how to rebuild public corpora, build the corpus with the tools/dezrann-corpus.py script on a local or on a public Dezrann installation. Note that the corpus has to be built and checked on a local installation of Dezrann (or on the test server, contact us to have an account) and before upload to the production public server.

Step 3. Check/curation

🏝️ Once the corpus is on Dezrann, in the sandbox (as for exemple on https://test.dezrann.net/~/salperwick-sandbox/piece-yourname)

Check every score, browsing each score until its end;
When it was not done before, synchronize the audio files (see synchro);
Update data/metadata in the metadata/my-corpus.json file, and to rebuild the corpus;
In particular, fill/update quality values per piece, according to quality
List the problems encountered. If we do not manage to fix them, we may hide some scores of the corpus. This has to be documented in a text file, or better, for a corpus intended to be public, filled with issues on the dezrann-corpus gitlab.

Step 4. Publication and long-term maintainance

(Communication, maintenance)

Check again corpus metadata, in particular the presentation material (text, motto, availability, status), with a few lines presenting the corpus;
Contact us to finalize the publication/release;
To improve again reproducibility, prepare with tools/archive.py a long-term archive (to be documented) and upload it on a institutional repository;
After the corpus is published, update these data or metadata when needed.