CSV Schema

A text based schema language (CSV Schema) for describing data in CSV files for the purposes of validation. Released as Open Source under the Mozilla Public Licence version 2.0.

Overview

Firstly, we defined a Grammar which describes a language for expressing rules to validate a CSV file. We call such an expression of this language a CSV Schema. The grammar itself is more formally described in EBNF and is available in the CSV Schema Specification.

Secondly, we created a reference implemention, in the form of a Validator Tool and API (CSV Validator) that will take a CSV Schema file and a CSV file, verify that the CSV Schema itself is syntactically correct and then assert that each rule in the CSV Schema holds true for the CSV file.

The Schema and Validator can really be considered separately, you do not need to be aware of the validation tool or API to author CSV Schema.

Background

The National Archives receive Metadata along with Collections of Digitised or Born-Digital Collections. Whilst The National Archives typically process Metadata in XML and RDF, it was recognised that it was too difficult and/or expensive for many suppliers to produce the desired metadata in XML and/or RDF, as such it was decided that Metadata would be received in CSV format.

Our experience shows that when suppliers are asked to produce metadata in XML or RDF there are several possible barriers:

Many content/document repository systems only export metadata in CSV, or generate XML or RDF in a non-desirable format which would then have to be transformed (at further cost).
Lack of technical knowledge in either XML or RDF.
Lack of experience of tools for producing and validating XML or RDF.
Cost. Installing new software tools comes at a severe cost for those OGDs that have outsourced their IT support.
Best/Worst case, most suppliers already have Microsoft Excel (or an equivalent) installed which they know how to use to produce a CSV file.

The National Archives set exacting requirements on the Metadata that they expect and the format of that Metadata. Such constraints enable them to automatically process it, as the semantics of the metadata are already defined. Whilst previous bespoke tools have been developed in the past for validating data in various CSV files, it was felt that a generic open tool which could be shared with suppliers would offer several benefits:

A common CSV Schema language, would enable The National Archives to absolutely define required Metadata formats.
Developed CSV Schemas could be shared with suppliers and other archival sector organisations.
Suppliers could validate Metadata before sending it to The National Archives, by means of our CSV Validator tool. Hopefully reducing mistakes and therefore costs to both parties.
The National Archives could use the same tool to ensure Metadata compliance automatically.
Although not of primary concern, it was recognised that this tool would also have value for anyone working with CSV as a data/metadata transfer medium.

CSV Schema Language

The CSV Schema Language is defined in the CSV Schema Language 1.1 specification, (this supersedes the original CSV Schema Language 1.0 specification as 25 January 2016). It is suggested that the extension .csvs be used for CSV Schema Language files. There is also a working draft of CSV Schema Language 1.2, with a few new features.

Reference Implementation

For details of the CSV Validator tool and API see https://github.com/digital-preservation/csv-validator.

Example CSV Schemas

In order to understand how to write CSV Schemas in practice, see the example CSV Schema file, digitised_surrogate_tech_acq_metadata_v1.1_TESTBATCH000.csvs, in the GitHub repository digital-preservation/csv-schema/example-schemas. In the example-data subfolder you will find a CSV file, digitised_surrogate_tech_acq_metadata_v1_TESTBATCH000.csv, which complies with the schema. This CSV file refers to XML files in the folder structure below TEST_1

For Software Developers

See https://github.com/digital-preservation/csv-schemas.