A text based schema language (CSV Schema
) for describing data in CSV files for the purposes of validation. Released as Open Source under the Mozilla Public Licence version 2.0.
Firstly, we defined a Grammar which describes a language for expressing rules to validate a CSV file. We call such an expression of this language a CSV Schema
. The grammar itself is more formally described in EBNF
and is available in the CSV Schema Specification.
Secondly, we created a reference implemention, in the form of a Validator Tool and API (CSV Validator
) that will take a CSV Schema file and a CSV file, verify that the CSV Schema itself is syntactically correct and then assert that each rule in the CSV Schema holds true for the CSV file.
The Schema and Validator can really be considered separately, you do not need to be aware of the validation tool or API to author CSV Schema.
The National Archives receive Metadata along with Collections of Digitised or Born-Digital Collections. Whilst The National Archives typically process Metadata in XML and RDF, it was recognised that it was too difficult and/or expensive for many suppliers to produce the desired metadata in XML and/or RDF, as such it was decided that Metadata would be received in CSV format.
Our experience shows that when suppliers are asked to produce metadata in XML or RDF there are several possible barriers:
The National Archives set exacting requirements on the Metadata that they expect and the format of that Metadata. Such constraints enable them to automatically process it, as the semantics of the metadata are already defined. Whilst previous bespoke tools have been developed in the past for validating data in various CSV files, it was felt that a generic open tool which could be shared with suppliers would offer several benefits:
The CSV Schema Language is defined in the CSV Schema Language 1.1 specification, (this supersedes the original CSV Schema Language 1.0 specification as 25 January 2016). It is suggested that the extension .csvs be used for CSV Schema Language files. There is also a working draft of CSV Schema Language 1.2, with a few new features.
For details of the CSV Validator tool and API see https://github.com/digital-preservation/csv-validator.
In order to understand how to write CSV Schemas in practice, see the example CSV Schema file, digitised_surrogate_tech_acq_metadata_v1.1_TESTBATCH000.csvs, in the GitHub repository digital-preservation/csv-schema/example-schemas. In the example-data subfolder you will find a CSV file, digitised_surrogate_tech_acq_metadata_v1_TESTBATCH000.csv, which complies with the schema. This CSV file refers to XML files in the folder structure below TEST_1