This document represents the specification of the CSV Schema 1.0 language as defined by The National Archives. It is unclear yet whether this document will be submitted to a formal standards body such as the W3C.
CSV (Comma Separated Value) data comes in many shapes and sizes. Apart from [[RFC4180]] which is a fairly recent development (and often ignored), there is a lack of formal definition as to CSV data formats, although in many ways this is one of the strengths of the CSV data format. However, extracting structured information from CSV data for further processing or storage can prove difficult if the CSV data is not well understood or perhaps not even uniform. CSV Schema defines a textual language which can be used to define the data structure, types and rules for CSV data formats.

Introduction

The intention of this document is two-fold:

  1. To be informative to users who are writing CSV Schemas, and provide a reference to the available syntax and functions.
  2. To provide enough detail such that anyone with sufficient technical expertise should be able to implement a CSV Schema parser and/or CSV validator evaluating the rules defined in a CSV Schema.

Background

The National Archives DRI (Digital Repository Infrastructure) system archives digitised and born-digital materials provided by OGDs (Other Government Departments) and occasionally NGOs (Non-Governmental Organisations). For the purposes of Digital Preservation the system processes and archives large amounts of metadata, much of this metadata is created by the supplying organisation or by transcription. The metadata is further processed, and ultimately stored both online in an RDF Triplestore and a majority subset archived in a non-RDF XML format. However it was recognised that the creation of XML or RDF metadata by the supplier was most likely unrealistic for either technical or financial reasons. As such, CSV was recognised as a simple data format that is human readable (to a degree), that almost anyone could create simply; CSV is the lowest common denominator of structured data formats.

The National Archives have strict rules about various CSV file formats that they expect, and how the data in those file formats should be set out. To ensure the quality of their archival metadata it was recognised that CSV files would have to be validated. It was recognised that development of a schema language for CSV (and associated tools) would be of great benefit. It was also further recognised that a general CSV Schema language would be of greater benefit if it was made publicly available and invited collaboration from other organisations and individuals; the problem of CSV data formats is certainly not unique to The National Archives.

CSV Schema is a standard currently guided by The National Archives, but developed in an open source collaborative manner that invites collaboration and contributions from all interested parties.

A reference implementation has been created to prove the standard: The open source CSV Validator application and API, offers both CSV Schema parsing and CSV file validation.

Guiding Principles

The design of the CSV Schema language has been influenced by a few guiding principles, understanding these will help you to understand how and why it is structured the way that it is.

Basics

A CSV Schema is really a rules based language which defines how data in each cell should be formatted. Rules are expressed per-column of the CSV data. Rules are evaluated for each row in the CSV data. A column rule may express constraints based on the content of other columns in the same row, however at present there is no scope for looking forward or backward through rows directly. However, it is possible to check that a cell entry is unique within that column in the CSV file (or that the value of a combination of cells is unique)

A CSV Schema is made up of two main parts:

  1. Prolog

    In turn this comprises (at most) two sections (the second being OPTIONAL):

    1. Version Declaration

      The CSV Schema MUST explicitly state (as its first non-comment line) the version of the CSV Schema language that it uses. This is to allow for future evolution of the CSV Schema language to be easily handled by CSV Schema processors.

    2. Global Directives

      Global Directives apply to all processing of the CSV data. Global Directives for example allow you to define the separator sequence between columns in the CSV data. Global Directives appear before Column Rules and are prefixed with an @ character.

      The use of Global Directives is OPTIONAL, default values are used if they are not specified.

  2. Body

    The Body of the CSV Schema MUST declare, in order, a Column Rule for each Column in the CSV data. If validation of a Column is not desirable, then an empty rule is used.

Let's now illustrate a simple CSV Schema that is concerned with CSV data about names, ages and gender:

version 1.0
@totalColumns 3
name: notEmpty
age: range(0, 120)
gender: is("m") or is("f") or is("t") or is("n") 
            
This CSV Schema basically defines that the CSV data must have 3 columns: the first column, name, must have some sort of value; the second column, age, must be a number between 0 and 120 inclusive; and the third column, gender, must be one of the characters m, f, t or n. An example of CSV data that would match the rule definitions in the CSV schema could be as follows:
name,age,gender
james,21,m
lauren,19,f
simon,57,m
        

An example of CSV data would fail the rule definitions in the CSV schema could be as follows:

name,age,gender
james,4 years,m
lauren,19,f
simon,57,male
        

The Invalid CSV Data example above fails when validated against the CSV Schema because: 1) at row 2 column 2, "4 years" is not a number between 1 and 120 inclusive, and 2) at row 4 column 3, "male" is not one of the characters m, f, t, or n.

Schema structure

The CSV schema language is formally a context-free grammar expressed in Extended Backhaus-Naur Form (EBNF - see also [[RFC5234]])

The following subsections examine the structure of a CSV Schema in more detail. Each subsection comprises definitions of terms, cross-references to other definitions, the relevant portion of the EBNF (links on the lefthandside go to the appendix containing the full EBNF, those on the right to a fuller explanation of those term(s)), and examples of correct usage.

A CSV schema MUST comprise both Prolog and Body.

[1] Schema ::= Prolog Body

Prolog

The Prolog of a CSV Schema MUST contain the Version Declaration and MAY contain one or more Global Directives.

[2] Prolog ::= VersionDecl GlobalDirectives

Version Declaration

The Version Declaration declares explicitly which version of the CSV Schema language is in use. At present this MUST be fixed to version 1.0. If the version is not valid this is considered a Schema Error. The Version Declaration is MANDATORY.

[3] VersionDecl ::= "version 1.0"

Example Version Declaration

		version 1.0
						

Global Directives

The Global Directives allow you to modify the overall processing of a CSV file or how subsequent Column Definitions are evaluated. The use of Global Directives within a CSV Schema is OPTIONAL. The last two Global Directives described (No Header Directive and Ignore Column Name Case Directive) are mutually exclusive, they MUST NOT both be used in a single schema. There is no inherent reason why the Global Directives should be in the order shown, EBNF does not directly cater for unordered lists. You could explicitly list each possible ordering, but that would require 4!=24 orderings to be included in the ENBF. All directives (both Global Directives and Column Directives) used in the CSV Schema are indicated by the Directive Prefix, defined as the character @ i.e. the [[UTF-8]] character code 0x40.

Whitespace is not generally significant, so Global Directives can be entered all on a single line, or each on separate lines (see ).

[4] GlobalDirectives ::= SeparatorDirective? QuotedDirective? TotalColumnsDirective? PermitEmptyDirective? (NoHeaderDirective | IgnoreColumnNameCaseDirective)? /* xgc:unordered */
[5] DirectivePrefix ::= "@"

Separator Directive

The Separator Directive allows you to specify the separator character that is used between columns in the CSV data. As with all Global Directives the Separator Directive is OPTIONAL, if not supplied the default value is assumed. By default the separator is a comma (,) i.e. the [[UTF-8]] character code 0x2c (as specified in [[RFC4180]]).

The Separator Directive takes a MANDATORY parameter in the form of either a Separator Tab Expression or a Separator Character.

A Separator Tab Expression indicates that the separator comprises a tab character, i.e. [[UTF-8]] character code 0x09.

A Separator Character is a Character Literal: the character which is to be treated as the column separator.

[6] SeparatorDirective ::= DirectivePrefix "separator" (SeparatorTabExpr | SeparatorChar)
[7] SeparatorTabExpr ::= "TAB" | '\t'
[8] SeparatorChar ::= CharacterLiteral

Quoted Directive

The Quoted Directive allows you to specify whether or not all columns are quoted. That is to say that their values are encased in quotation mark characters, i.e. [[UTF-8]] character code 0x22.

[9] QuotedDirective ::= DirectivePrefix "quoted"

Total Columns Directive

The Total Columns Directive allows you to specify the total number of data columns expected to make up each row of the CSV file. The parser will also verify that the Body of the CSV Schema contains the same number of Column Rules, a mismatch is considered a Schema Error. The Total Columns Directive is OPTIONAL: when this directive is not used this verification of the number of Column Rules cannot be performed, and it will be assumed that you have supplied the correct number of Column Rules.

[10] TotalColumnsDirective ::= DirectivePrefix "totalColumns" PositiveNonZeroIntegerLiteral

Permit Empty Directive

The Permit Empty Directive allows you to specify that the CSV file can be empty: i.e. there is no row data. The Permit Empty Directive is OPTIONAL: when not present an empty file will cause a validation error. The Permit Empty Directive can be used in conjunction with the No Header Directive thereby permitting a completely empty CSV file. If the No Header Directive is not present then a minimum of one row containing column names must be provided. This directive is NOT available in Schema Version 1.0.

[11] PermitEmptyDirective ::= DirectivePrefix "permitEmpty"

No Header Directive

The No Header Directive is used to indicate that the CSV file to be validated does not contain a header row: i.e. all rows are data rows. The No Header Directive is OPTIONAL: when this directive is not used the parser assumes by default that the first row of the CSV file to be validated contains column names, not data, and so the first row is skipped during validation.

The No Header Directive is mutually exclusive to the use of the Ignore Column Name Case Directive, when one is used, the other MUST NOT be.

[12] NoHeaderDirective ::= DirectivePrefix "noHeader"

Ignore Column Name Case Directive

The Ignore Column Name Case Directive is intended to tell the parser to ignore mismatches in case between the Column Identifiers supplied in a CSV file to be validated and those used in giving the Column Definitions in the schema.

The Ignore Column Name Case Directive is mutually exclusive to the use of the No Header Directive, when one is used, the other MUST NOT be.

[13] IgnoreColumnNameCaseDirective ::= DirectivePrefix "ignoreColumnNameCase"

Example Global Directives

	@separator ';' @quoted @totalColumns 21 @noHeader
					
	@separator TAB
	@quoted
	@totalColumns 21
	@permitEmpty
	@ignoreColumnNameCase
					

Body

The Body of a CSV Schema comprises at least one Body Part, each of which is a combination of OPTIONAL Comments with a Column Definition (in either order). A Column Definition MUST be included.

[14] Body ::= BodyPart+
[15] BodyPart ::= Comment* ColumnDefinition Comment*

Comments

There are two types of Comment: either Single Line or Multiple Line.

A Single Line Comment is started with two forward slashes (//), i.e. the [[UTF-8]] character codes 0x2F 0x2F. It is terminated by any [[UTF-8]] character that creates a line-break.

A Multiple Line Comment is started using the combination of a forward slash (/) and an asterisk (*), i.e. the [[UTF-8]] character codes 0x2F 0x2A. It is terminated by the reverse combination, asterisk (*) forward slash (/), i.e. the [[UTF-8]] character codes 0x2A 0x2F. Any [[UTF-8]] character except asterisk may be used between these comment markers, even if it forces a new line. Comments do not need to start at the beginning of a line, but must be either before or after a complete Column Definition or another Comment.

[16] Comment ::= SingleLineComment | MultiLineComment
[17] SingleLineComment ::= //[\S\t ]* /* xgc:regular-expression */
[18] MultiLineComment ::= \/\*(?:[^*\r\n]+|(?:\r?\n))*\*\/ /* xgc:regular-expression */

Example Comments

	//This Comment is a Single Line Comment it terminates at this line break
	/*This Comment is a Multi Line Comment:
	
	
	it
		
	can
	
	go
	
	on
	
	for as many lines as you like, until you type*/
					

Column Definitions

Column Definitions comprise a Column Identifier or Quoted Column Identifier followed by a colon (:), i.e. the [[UTF-8]] character code 0x3A, followed by a Column Rule. There MUST be a Column Definition for every column in the CSV that will be validated against the Schema, however the Column Rule can be left empty if no validation is needed for a specific column.

[19] ColumnDefinition ::= (ColumnIdentifier | QuotedColumnIdentifier) ":" ColumnRule

Column Identifiers

There are two classes of identifier that can be used for columns, the original simple Column Identifier, and the Quoted Column Identifier.

A Column Identifier is either a Positive Non Zero Integer Literal (most commonly used when the CSV file to be validated has no header row - see No Header Directive) which indicates the offset of the column (starting from 1), or an Ident.

The Quoted Column Identifier allows a greater range of characters to be used in naming the column than can be supported by an Ident, but the identifier must be wrapped in quotation marks ("), i.e. the [[UTF-8]] character code 0x22.

Identifiers MUST be unique within a single Schema.

[20] ColumnIdentifier ::= PositiveNonZeroIntegerLiteral | Ident
[21] QuotedColumnIdentifier ::= StringLiteral

Column Rules

A Column Rule is a combination of any number of Column Validation Expressions, along with OPTIONAL Column Directives. You MAY use an empty Column Rule if there is no requirement for an individual column to be validated.

As Column Validation Expressions are the primary means of applying validation, they are described in their own full section of this document. The range and variety of expressions available make supplying comprehensive examples here impractical, though some will be used to show the basic structure of a Column Rule.

White space is not generally important within a Column Rule, but the whole rule must be on a single line.

[22] ColumnRule ::= ColumnValidationExpr* ColumnDirectives
Column Directives

There are four OPTIONAL Column Directives that are used to modify aspects of how the Column Rules are evaluated. Like Global Directives, Column Directives are indicated by the Directive Prefix, defined as the character @ i.e. the [[UTF-8]] character code 0x40.

The Column Directives are the Optional Directive, the Match Is False Directive, the Ignore Case Directive, and the Warning Directive. The column directives may be specified in any order (though there is no straightforward way to express this in EBNF without listing every possible order).

[23] ColumnDirectives ::= OptionalDirective? MatchIsFalseDirective? IgnoreCaseDirective? WarningDirective?    /* xgc:unordered */ /* xgc:unordered */
Optional Directive

The Optional Directive is used when completion of the data field in the original CSV file to be validated is OPTIONAL. When this directive is used the data in the column is considered valid if the Column Rule evaluates to true, or if the column is empty.

[24] OptionalDirective ::= DirectivePrefix "optional"
Match Is False Directive

The Match Is False Directive is used to flip the result of a test from negative to positive (or vice-versa). It may be very simple to write a condition which matches the data considered to be invalid, while the equivalent for valid data would be very convoluted.

[25] MatchIsFalseDirective ::= DirectivePrefix "matchIsFalse"
Ignore Case Directive

The Ignore Case Directive is used when the case of a column value is not important. Two strings which differ only in the case used for characters within the string would be considered a match for all string related column rules.

[26] IgnoreCaseDirective ::= DirectivePrefix "ignoreCase"
Warning Directive

The Warning Directive is used to convert a Validation Error into a Validation Warning. This is useful if you wish to be alerted to a data condition which is unusual, but not necessarily invalid. For instance, at The National Archives we have come across archival material where the clerk who originally completed a form wrote down an "impossible" date such as 30 February or 31 April. We have to do our best to accept the data as originally supplied (we have no idea if it is the day or month of the date which is actually incorrect), but we also wish to ensure that additional Quality Assurance checking is performed to ensure this is not a transcription error. Warnings are listed in the validation report, but the data file is still considered valid if only warnings are present.

[27] WarningDirective ::= DirectivePrefix "warning"

Column Definitions examples

	a_column_title:			is("somedata") or is("otherdata")        	@optional @matchIsFalse @ignoreCase @warning
	another_column_title:	not("somedata") and not("otherdata")	@ignoreCase
					

The two Column Definitions are both validating the data in their respective columns against the explicit strings somedata and otherdata. Ignoring the Column Directives for the moment, the column rule defined for a_column_title would return true if the CSV data for that column contained either the precise string somedata or otherdata. However, the Optional Directive means a completely empty column would also be acceptable. Also, since the Ignore Case Directive is also applied, the strings SomeData or OTHERDATA (for example) would also be acceptable. But, since the Match Is False Directive is in effect, the validation is inverted, it would actually be any string other than the two specified which would be regarded as acceptable data. Since the Warning Directive is also used, a validation failure would not be considered an error though.

The second Column Definition (with the effect of the Match Is False Directive on the first taken into account) is actually logically equivalent to the first (see De Morgan's Laws). However, since the Optional Directive has not been used, an empty column would not be valid data, and since the Warning Directive has also not been included, a Validation Warning would be raised instead of a Validation Error.

Column Validation Expressions

The key building blocks for Column Rules are Column Validation Expressions. These are divided into two main classes, Non Conditional Expressions and Conditional Expressions. Non Conditional Expressions boil down to checks resulting in a pass or fail (a number of expressions may be combined to produce an overall validation check), Conditional Expressions allow for more subtle checks, if for example you are validating a title column which allows the values Mr, Mrs, Ms, Miss and Dr, You could construct a Conditional Expression which also checks the sex column and if that contains female, then Mr would be regarded as invalid (strictly speaking that would also require the use of an Explicit Context Expression to refer to the other column, but that is a subexpression of the Non Conditional Expression class).

NOTE To increase control over expression applicability and to avoiding creating a left-recursive grammar (which could lead to problems for various parser implementations), Column Validation Expressions have been further split into Combinatorial Expressions and Non Combinatorial Expressions.

[28] ColumnValidationExpr ::= CombinatorialExpr | NonCombinatorialExpr

Combinatorial Expressions

A Combinatorial Expression combines one or more Column Validation Expressions, which allows more complicated tests on the validity of data in a column. There are two types, Or Expressions and And Expressions. They are of equal precedence, and evaluation of Column Validation Expressions is performed from left-to-right.

[29] CombinatorialExpr ::= OrExpr | AndExpr

Or Expressions

An Or Expression is used as a standard boolean operator to indicate that the column data should be treated as being valid if either (or both) the expressions linked by the Or Expression evaluate to true.

[30] OrExpr ::= NonCombinatorialExpr "or" ColumnValidationExpr

And Expressions

An And Expression is used as a standard boolean operator to indicate that the column data should be treated as being valid when both the expressions linked by the And Expression evaluate to true. Use of an explicit And Expression is OPTIONAL: if two Column Validation Expressions are written in succession for the same column they will be treated as having an implicit And Expression joining them.

[31] AndExpr ::= NonCombinatorialExpr "and" ColumnValidationExpr

Non Combinatorial Expressions

A Non Combinatorial Expression is a Column Validation Expression which is evaluated by itself, unless it is combined with another through a Combinatorial Expression. The majority of Column Validation Expressions are of the Non-Combinatorial Expression class.

[32] NonCombinatorialExpr ::= NonConditionalExpr | ConditionalExpr

Non Conditional Expressions

Non Conditional Expressions are divided into three classes of sub-expressions: Single Expressions, External Single Expressions, and Parenthesized Expressions. The first two are individual validation checks (differing in that the second allows access to some resource outside the CSV file being validated), whilst the last provides a mechanism for controlling the evaluation order of complex compound expressions.

[33] NonConditionalExpr ::= SingleExpr | ExternalSingleExpr | ParenthesizedExpr

Single Expressions

Single Expressions are the basic building blocks of Column Rules. There are currently 22 available for use (and some have their own subexpressions used as parameters), although the first is really used as an OPTIONAL modifier to the rest. In many cases values can be provided to the test either as an explicit string (or number where appropriate), or by reference to the value held by another column.

[34] SingleExpr ::= ExplicitContextExpr? (IsExpr | NotExpr | InExpr | StartsWithExpr | EndsWithExpr | RegExpExpr | RangeExpr | LengthExpr | EmptyExpr | NotEmptyExpr | UniqueExpr | UriExpr | XsdDateTimeExpr | XsdDateExpr | XsdTimeExpr | UkDateExpr | DateExpr | PartialUkDateExpr | PartialDateExpr Uuid4Expr | PositiveIntegerExpr)

Explicit Context Expressions

The Explicit Context Expression is used to indicate that the expression following should be tested against the value in a foreign column (explicit context), rather than the current column (which is the default context).

[35] ExplicitContextExpr ::= ColumnRef "/"

Is Expressions

An Is Expression checks that the value of the column is identical to the supplied string or the value in the referenced column.

[36] IsExpr ::= "is(" StringProvider ")"

Not Expressions

A Not Expression checks that the value of the column is not equal to the supplied string or the value in the referenced column.

[37] NotExpr ::= "not(" StringProvider ")"

In Expressions

An In Expression checks that the value of the column contains the supplied string or the value in the referenced column (i.e. the column value is a super string of the supplied value).

[38] InExpr ::= "in(" StringProvider ")"

Starts With Expressions

A Starts With Expression checks that the value of the column starts with the supplied string or the value in the referenced column.

[39] StartsWithExpr ::= "starts(" StringProvider ")"

Ends With Expressions

An Ends With Expression checks that the value of the column ends with the supplied string or the value in the referenced column.

[40] EndsWithExpr ::= "ends(" StringProvider ")"

Regular Expression Expressions

A Regular Expression Expression checks the value of the column against the supplied Regular Expression.

Whilst obviously many of the the other Column Validation Expressions could be written as Regular Expressions, it is felt that that would make the resulting Schema harder to read, and much harder to write for less technical users. As such it is recommended, that if a Column Rule can be written without regular expressions, by instead using other Column Validation Expressions, then that approach should be attempted first.

The Regular Expression syntax used is that from Java's Pattern class.

[41] RegExpExpr ::= "regex(" StringLiteral ")"

Range Expressions

A Range Expression checks that the value of the column is a number lying between, or equal to, the supplied upper and lower bounds. The bounding values are defined as Numeric Literals which accept real numbers expressed as decimals.

[42] RangeExpr ::= "range(" NumericLiteral "," NumericLiteral ")"

Length Expressions

A Length Expression checks that the number of characters in the column meets the supplied definition. You can define the length in one of four ways:

  1. Precisely n characters long. For example length(10) ensures that the length is exactly 10 characters.
  2. At least n characters long. For example length(10, *) ensures that the length is at least 10 or more characters
  3. At most n characters long. For example length(*, 10) ensures that the length is at most 10 characters.
  4. Between m and n characters long (inclusive). For example length(4, 7) ensures that the length is at least 4 characters and at most 7 characters.

In order to do this the expression takes two input parameters, the first is defined as OPTIONAL, the second MUST be supplied. Both take the form of a Positive Integer Or Any expression, which is either a Positive Integer Literal (which actually includes zero), or a Wildcard Literal.

[43] LengthExpr ::= "length(" (PositiveIntegerOrAny ",")? PositiveIntegerOrAny ")"
[44] PositiveIntegerOrAny ::= PositiveIntegerLiteral | WildcardLiteral

Empty Expressions

An Empty Expression checks that the column has no content.

[45] EmptyExpr ::= "empty"

Not Empty Expression

A Not Empty Expression checks that the column has some content, though precisely what does not matter.

[46] NotEmptyExpr ::= "notEmpty"

Unique Expressions

A Unique Expression checks that the column value is unique within the CSV file being validated (within the current column, the value may occur elsewhere in the file in another column, as in a primary-foreign key relationship in a database). You can also specify a comma separated list of Column References in which case the combination of values of those columns (for the current row) must be unique within the whole CSV file.

[47] UniqueExpr ::= "unique" ("(" ColumnRef ("," ColumnRef)* ")")?

URI Expressions

A URI Expression checks that the value in the column is a valid URI as defined in [[!RFC3986]].

[48] UriExpr ::= "uri"

XSD Date Time Expressions

An XSD Date Time Expression checks that the data in the column is expressed as a valid XML Schema dateTime data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]). You can also provide an OPTIONAL from and to date-times (inclusive) to ensure that the value in the column falls within an expected date-time range.

[49] XsdDateTimeExpr ::= "xDateTime" ("(" XsdDateTimeLiteral "," XsdDateTimeLiteral ")")?

XSD Date Expressions

An XSD Date Expression checks that the data in the column is expressed as a valid XML Schema date data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]). You can also provide OPTIONAL from and to dates (inclusive) to ensure that the value in the column falls within an expected date range.

[50] XsdDateExpr ::= "xDate" ("(" XsdDateLiteral "," XsdDateLiteral ")")?

XSD Time Expressions

An XSD Time Expression checks that the data in the column is expressed as a valid XML Schema time data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]). You can also provide OPTIONAL from and to times (inclusive) to ensure that the value in the column falls within an expected time range.

[51] XsdTimeExpr ::= "xTime" ("(" XsdTimeLiteral "," XsdTimeLiteral ")")?

UK Date Expressions

A UK Date Expression checks that the data in the column is expressed as a valid UK-style date: dd/mm/yyyy, a UK Date Literal. You can also provide OPTIONAL from and to dates (inclusive) to ensure that the value in the column falls within an expected date range.

[52] UkDateExpr ::= "ukDate" ("(" UkDateLiteral "," UkDateLiteral ")")?

Date Expression

A Date Expression allows a full date to be constructed from several columns, or strings. The expression takes five arguments: you MUST supply three strings or Column References representing Year, Month and Day (although supplied as strings, these values must in fact be integers); and there are two OPTIONAL parameters to ensure the date falls in a range specified by from XSD Date Expression and to XSD Date Expression.

[53] DateExpr ::= "date(" StringProvider "," StringProvider "," StringProvider ("," XsdDateLiteral "," XsdDateLiteral)? ")"

Partial UK Date Expression

A Partial UK Date Expression is essentially the same as a UK Date Expression, but allows for difficulties in transcribing from original archival material by accepting a question mark (?), i.e. the [[UTF-8]] character code 0x3F in place of illegible digits in any position, or an asterisk (*), i.e. the [[UTF-8]] character code 0x2A in place of a missing value. As dates may not be complete, it is impossible to determine reliably if a date falls within a given range, so there is no option to supply one.

[54] PartialUkDateExpr ::= "partUkDate"

Partial Date Expression

A Partial Date Expression combines elements of Partial UK Date Expression with those of Date Expression, with the date being made up of columns or strings as in Date Expression, but also allowing the characters representing uncertainty as in Partial UK Date Expression. However, the constituent parts of the date MUST be supplied as Year, Month, Day.

[55] PartialDateExpr ::= "partDate(" StringProvider "," StringProvider "," StringProvider ")"

UUID4 Expression

A UUID4 Expression checks that the data in the column is in the form of a Version 4 UUID (Universally Unique Identifier), see [[!RFC4122]]. UUIDs MUST use lowercase hex values.

[56] Uuid4Expr ::= "uuid4"

Positive Integer Expression

A Positive Integer Expression checks that the column contains an integer value, greater than or equal to zero.

[57] PositiveIntegerExpr ::= "positiveInteger"

External Single Expressions

An External Single Expression allows access to resources outside the CSV file being validated in order to verify some information contained within the file. For example, to check that an image file referenced from within a CSV file actually exists, or that a supplied checksum matches the value calculated for a file. The available expressions are the File Exists Expression, Checksum Expression and File Count Expression. Each may be prefixed with an Explicit Context Expression in order to refer to data in a different column.

[59] ExternalSingleExpr ::= ExplicitContextExpr? (FileExistsExpr | ChecksumExpr | FileCountExpr)

File Exists Expressions

A File Exists Expression checks a filesystem to see if the specified file actually exists at the specified path. It takes an OPTIONAL expression in the form of a String Provider which allows you to supply a string (or reference to a string) with a full filepath to prepend to the contents of the current column (in the case that for example it only contains just the name of the file).

[60] FileExistsExpr ::= "fileExists" ("(" StringProvider ")")?

Checksum Expressions

A Checksum Expression allows the verification of a checksum value supplied in a CSV file by calculating the checksum for a specific file, and comparing it against the supplied value. You can also indicate the type of checksum algorithm to be used. You MUST supply both the file location and a checksum algorithm. File location is given in the form of a File Expression.

The EBNF does not specify valid values for the String Literal representing the checksum algorithm, that is instead implementation defined, however it is strongly recommended that implementations SHOULD at least support: MD5, SHA-1 and SHA-256. It is important to note that the checksum value MUST use lowercase hexadecimal characters only.

[61] ChecksumExpr ::= "checksum(" FileExpr "," StringLiteral ")"

File Count Expressions

A File Count Expression allows a column representing the number of files in a particular folder to be verified against the actual files on disk. You MUST provide a File Expression as an input parameter which points to a filesystem folder to compare the count of files against.

[63] FileCountExpr ::= "fileCount(" FileExpr ")"

File related sub-expressions

The sub-expression used in conjunction with External Single Expressions in order to provide input is a generic File Expression which itself takes two parameters. The first parameter is OPTIONAL, a String Provider that is prepended to the second parameter to create a full path in the event that a column holds only a filename rather than a full filepath. You MUST supply the second parameter which is a String Provider, which resolves to the name of the file.

[62] FileExpr ::= "file(" (StringProvider ",")? StringProvider ")"

Input parameters used in Single Expressions and External Single Expressions

Many Single Expressions and External Single Expressions take a String Provider as an input. A String Provider takes the form of either a Column Reference or a String Literal. A Column Reference comprises a dollar sign ($), i.e. the [[UTF-8]] character code 0x24, followed by a Column Identifier or Quoted Column Identifier.

[58] StringProvider ::= ColumnRef | StringLiteral
[35] ColumnRef ::= "$" (ColumnIdentifier | QuotedColumnIdentifier)

Parenthesized Expressions

Parenthesized Expressions are used to vary the standard left-to-right evaluation order of evaluation of Combinatorial Expressions. Parenthesized Expressions can be nested, the deepest level will be evaluated first, working outwards. Equally nested Parenthesized Expressions revert to the standard left-to-right evaluation order.

[64] ParenthesizedExpr ::= "(" ColumnValidationExpr+ ")"

Conditional Expressions

A Conditional Expression is used to apply different Column Validation Expressions to a column on the basis of the result of the evaluation of some other Non Conditional Expression. This is particularly useful when the data expected in one column depends on the value of another column. There is only one form of Conditional Expression at present, the If Expression.

[65] ConditionalExpr ::= IfExpr

If Expressions

The If Expression is the only form of Conditional Expression. It takes three input parameters: the first two of these MUST be used, firstly a Combinatorial Expression or Non Conditional Expression; when that evaluates to true, the second parameter, one or more ColumnValidationExprs are applied; the third parameter is OPTIONAL, this is the else expression which is used when the first parameter evaluates to false.

[66] IfExpr ::= "if(" (CombinatorialExpr | NonConditionalExpr) "," ColumnValidationExpr+ ("," ColumnValidationExpr+)? ")"

Column Expression examples

	piece: is("1") and (in($file_path) and in($resource_uri))          /*The column "piece" must have the specific value 1
	                                                                   the value must also be part of the value of the columns "file_path" and "resource_uri"
	                                                                   explicit And Expression is used between each specified Column Expression*/
	item: range(1,540) unique($piece,$item)                            //this field must contain an integer between 1 and 540 inclusive.
	                                                                   the combination of piece and item must be unique within the file.
	file_uuid: uuid4 unique                                            /*must be a version 4 uuid, and the value must be unique within the file (uuids must be 
	                                                                   lower case). Here an implicit And Expression is used*/
	file_path: fileExists uri starts("file:///")                       /*fileExists checks that there is actually a file of the given name at the 
	                                                                   specified location on the file system which is assumed to be the value held in "file_path".  
	                                                                   We know the location should be in the form of a URI so a URI expression is used, 
	                                                                   and in particular this should be a file url, so we further specify that the data 
	                                                                   in the column must start "file:///" */
	file_checksum: checksum(file($file_path),"SHA-256")                /* Compare the value given in this field to the checksum calculated for the file
	                                                                   found at the location given in the "file_path" field.
	                                                                   Use the specified checksum algorithm (SHA-256)
	                                                                   (must use lowercase hex characters). */
	image_split: is("yes") or is("no")                                 //must be string: yes; or string: no (precisely - case as shown)
	image_split_other_uuid: if($image_split/is("yes"),uuid4,is(""))    //if "image_split" field is yes, must be a uuid4, else must be blank
	image_split_operator: if($image_split/is("yes"),length(1,12) and regex("^[0-9a-zA-Z]{1,12}$"),is(""))
	                                                                   /*If "image_split" field is the string: yes (precisely)
	                                                                   then field must be 12 characters long.  This is further restricted by regex statement
	                                                                   to being only alphanumeric characters (upper and lower case). */
	image_split_timestamp: if($image_split/is("yes"),xDateTime(2013-12-04T00:00:00+00:00,2014-03-04T23:59:59+00:00),is(""))
	                                                                   /*If "image_split" field is string: yes (precisely)
	                                                                   then timestamp for image split, compliant with XSD DateTime data type
	                                                                   and in range 4 December 2013 - 4 March 2014 (from the midnight starting 4 December, 
	                                                                   to last second of 4 March), else it must be blank (ie "image_split" is no).
				

Data types

Most Column Validation Expressions rely on a small number of underlying data types. Some of these are defined by means of a regular expression embedded in the EBNF as indicated by xgc:regular-expression. There are 11 data types, XSD Date Time Literal, XSD Date Literal, XSD Time Literal, UK Date Literal, Positive Non Zero Integer Literal, Positive Integer Literal, Numeric Literal, StringLiteral, Character Literal, Wildcard Literal and Identifier.

XSD Date and Time Types

XSD Date Time Literals

An XSD Date Time Literal MUST be a valid XML Schema dateTime data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form yyyy-mm-ddThh:mm:ss(.MMMMM) (year-month-day followed by time in hours, minutes, seconds and OPTIONAL fractions of a second). There are two OPTIONAL parts, a minus sign MAY be used for BC dates, and there MAY be a suffix indicating the applicable timezone as an offset from GMT/UTC. GMT itself may be indicated by a suffix Z for Zulu. It is represented in the EBNF by a regular expression defining precisely which characters are to be used.

[67] XsdDateTimeLiteral ::= XsdDateWithoutTimezoneComponent "T" XsdTimeLiteral

XSD Date Literals

An XSD Date Literal MUST be a valid XML Schema date data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form yyyy-mm-dd (year-month-day). There is one OPTIONAL part, a minus sign MAY be used for BC dates. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is identical to the date part of XSD Date Time Literal.

[68] XsdDateLiteral ::= XsdDateWithoutTimezoneComponent XsdTimezoneComponent

XSD Time Literals

An XSD Time Literal MUST be a valid XML Schema date data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form hh:mm:ss(.MMMMM) (time in hours, minutes, seconds and OPTIONAL fractions of a second). There is one OPTIONAL part, there MAY be a suffix indicating the applicable timezone as an offset from GMT/UTC. GMT itself may be indicated by a suffix Z for Zulu. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is identical to the time part of XSD Date Time Literal.

[69] XsdTimeLiteral ::= XsdTimeWithoutTimezoneComponent XsdTimezoneComponent

Common XSD Date and Time Components

The various XSD Date and Time data types from [[!XMLSCHEMA-2]] are made up from common reuseable components that are defined by regular expressions.

[70] XsdDateWithoutTimezoneComponent ::= -?[0-9]{4}-(((0(1|3|5|7|8)|1(0|2))-(0[1-9]|(1|2)[0-9]|3[0-1]))|((0(4|6|9)|11)-(0[1-9]|(1|2)[0-9]|30))|(02-(0[1-9]|(1|2)[0-9]))) /* xgc:regular-expression */
[71] XsdTimeWithoutTimezoneComponent ::= ([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(\.[0-9]{3})? /* xgc:regular-expression */
[72] XsdTimezoneComponent ::= ((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z)? /* xgc:regular-expression */

UK Date Literals

A UK Date Literal is a data type representing the usual UK format for writing dates, dd/mm/yyyy. It is represented in the EBNF by a regular expression defining precisely which characters are to be used.

[73] UkDateLiteral ::= (((0[1-9]|(1|2)[0-9]|3[0-1])\/(0(1|3|5|7|8)|1(0|2)))|((0[1-9]|(1|2)[0-9]|30)\/(0(4|6|9)|11))|((0[1-9]|(1|2)[0-9])\/02))\/[0-9]{4} /* xgc:regular-expression */

Positive Non Zero Integer Literals

A Positive Non Zero Integer Literal is a data type representing the positive integer natural numbers. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is a specialisation of Positive Integer Literal.

[74] PositiveNonZeroIntegerLiteral ::= [1-9][0-9]* /* xgc:regular-expression */

Positive Integer Literals

A Positive Integer Literal is a data type representing the non-negative integer natural numbers. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is a specialisation of Numeric Literal.

[75] PositiveIntegerLiteral ::= [0-9]+ /* xgc:regular-expression */

Numeric Literals

A Numeric Literal is a data type representing any real number expressed as an integer or decimal.

[76] NumericLiteral ::= -?[0-9]+(\.[0-9]+)? /* xgc:regular-expression */

String Literals

A String Literal is zero or more characters (excluding quotation mark) encased witin quotation marks, i.e. the [[UTF-8]] character code 0x22.

[77] StringLiteral ::= "\"" [^"]* "\""

Character Literals

A Character Literal is a single non-breaking character encased within apostrophes, i.e. the [[UTF-8]] character code 0x27.

[78] CharacterLiteral ::= "'" [^\r\n\f'] "'"

Wildcard Literals

A Wildcard Literal is a single asterisk (*), i.e. the [[UTF-8]] character code 0x2A.

[79] WildcardLiteral ::= "*"

Identifiers

An Identifier is the set of characters which can be used in an ordinary Column Identifier. Upper and lower case alphabetic characters (unaccented), along with digits 0-9, the hyphen-minus (-), low line (_) and full stop (.), i.e. the [[UTF-8]] character codes 0x2D, 0x5F and 0x2E.

[80] Ident ::= /* [A-Za-z0-9\-_\.]+ */

Errors and Warnings

An implementation MUST first check that the provided CSV Schema(s) are syntactically correct. If not, a Schema Error is produced, and no further validation SHOULD of the CSV Schema(s) or provided CSV files(s) should be undertaken. If the schema check is successful then an implementation MAY continue with further CSV Schema(s) and CSV file validation.

If an implementation performs validation of a CSV file against a CSV Schema, a report SHOULD be produced for each Column Validation Expression that fails validation; This is generally considered a Validation Error, unless the Warning Directive has been used reduce the severity of an error within a specific Column Rule to a Validation Warning.

Schema Errors

A Schema Error is caused by syntax errors in the definition of the CSV Schema. These might include for example: an incorrect Version Declaration, or a mismatch between the number specified for Total Columns Directive and the actual number of Column Definitions given in the Body of the Schema. Schema Errors would also be produced by mismatched Parenthesized Expressions, unrecognised Column Validation Expressions and Explicit Context Expressions which do not match an actual Column Identifier.

An implementation MUST report a Schema Error.

Validation Errors

If column data does not validate successfully against a Column Rule, an implementation SHOULD report a Validation Error. It is implementation defined whether a Validation Error terminates execution, or whether execution continues. If execution continues, any further errors SHOULD be reported.

NOTE The Warning Directive may be used within a Column Rule to specify that what would normally be a Validation Error should be treated only as a Validation Warning.

A Validation Warning can be used when you wish to highlight unexpected values that are encountered in the data, but for some reason they are not to be considered failures of validation. For example within archival documents the date may have been recorded as an impossible date, such as 30 February or 31 April. A transcriber has correctly entered the data as seen on the original document, but yet it is not a valid date. You may wish to highlight these cases for additional QA, but it should not be considered an error.

The text/csv-schema Media Type

This Appendix specifies the media type for CSV Schema Version 1.0. CSV Schema is a language for describing and validating CSV files, as specified in the main body of this document. This media type has being submitted to the IESG (Internet Engineering Steering Group) for review, approval, and registration with IANA (Internet Assigned Numbers Authority.)

The text/csv-schema media type, is intended to be used for transmitting schemas written in the CSV Schema language.

File Extensions

The suggested file extension for use when naming CSV Schema files is .csvs.

CSV Schema Grammar

EBNF

The grammar of CSV Schema uses the same simple EBNF (Extended Backus-Naur Form) notation as [[!XML10]] with the following minor differences.

To increase readability, the EBNF in the main body of this document omits some of these notational features. This appendix is the normative version of the EBNF.

Link conventions used in this appendix:

Schema ::= Prolog Body
Prolog ::= VersionDecl GlobalDirectives
VersionDecl ::= "version 1.0"
GlobalDirectives ::= SeparatorDirective? QuotedDirective? TotalColumnsDirective? (NoHeaderDirective | IgnoreColumnNameCaseDirective)? /* xgc:unordered */
DirectivePrefix ::= "@"
SeparatorDirective ::= DirectivePrefix "separator" (SeparatorTabExpr | SeparatorChar)
SeparatorTabExpr ::= "TAB" | '\t'
SeparatorChar ::= CharacterLiteral
QuotedDirective ::= DirectivePrefix "quoted"
TotalColumnsDirective ::= DirectivePrefix "totalColumns" PositiveNonZeroIntegerLiteral
NoHeaderDirective ::= DirectivePrefix "noHeader"
IgnoreColumnNameCaseDirective ::= DirectivePrefix "ignoreColumnNameCase"
Body ::= BodyPart+
BodyPart ::= Comment* ColumnDefinition Comment*
Comment ::= SingleLineComment | MultiLineComment
SingleLineComment ::= //[\S\t ]* /* xgc:regular-expression */
MultiLineComment ::= \/\*(?:[^*\r\n]+|(?:\r?\n))*\*\/ /* xgc:regular-expression */
ColumnDefinition ::= (ColumnIdentifier | QuotedColumnIdentifier) ":" ColumnRule
ColumnIdentifier ::= PositiveNonZeroIntegerLiteral | Ident
QuotedColumnIdentifier ::= StringLiteral
ColumnRule ::= ColumnValidationExpr* ColumnDirectives
ColumnDirectives ::= OptionalDirective? MatchIsFalseDirective? IgnoreCaseDirective? WarningDirective? /* xgc:unordered */
OptionalDirective ::= DirectivePrefix "optional"
MatchIsFalseDirective ::= DirectivePrefix "matchIsFalse"
IgnoreCaseDirective ::= DirectivePrefix "ignoreCase"
WarningDirective ::= DirectivePrefix "warningDirective"
ColumnValidationExpr ::= CombinatorialExpr | NonCombinatorialExpr
CombinatorialExpr ::= OrExpr | AndExpr
OrExpr ::= NonCombinatorialExpr "or" ColumnValidationExpr
AndExpr ::= NonCombinatorialExpr "and" ColumnValidationExpr
NonCombinatorialExpr ::= NonConditionalExpr | ConditionalExpr
NonConditionalExpr ::= SingleExpr | ExternalSingleExpr | ParenthesizedExpr
SingleExpr ::= ExplicitContextExpr? (IsExpr | NotExpr | InExpr | StartsWithExpr | EndsWithExpr | RegExpExpr | RangeExpr | LengthExpr | EmptyExpr | NotEmptyExpr | UniqueExpr | UriExpr | XsdDateTimeExpr | XsdDateExpr | XsdTimeExpr | UkDateExpr | DateExpr | PartialUkDateExpr | PartialDateExpr | Uuid4Expr | PositiveIntegerExpr)
ExplicitContextExpr ::= ColumnRef "/"
ColumnRef ::= "$" (ColumnIdentifier | QuotedColumnIdentifier)
IsExpr ::= "is(" StringProvider ")"
NotExpr ::= "not(" StringProvider ")"
InExpr ::= "in(" StringProvider ")"
StartsWithExpr ::= "starts(" StringProvider ")"
EndsWithExpr ::= "ends(" StringProvider ")"
RegExpExpr ::= "regex(" StringLiteral ")"
RangeExpr ::= "range(" NumericLiteral "," NumericLiteral ")" /* range is inclusive */
LengthExpr ::= "length(" (PositiveIntegerOrAny ",")? PositiveIntegerOrAny ")"
PositiveIntegerOrAny ::= PositiveIntegerLiteral | WildcardLiteral
EmptyExpr ::= "empty"
NotEmptyExpr ::= "notEmpty"
UniqueExpr ::= "unique" ("(" ColumnRef ("," ColumnRef)* ")")?
UriExpr ::= "uri"
XsdDateTimeExpr ::= "xDateTime" ("(" XsdDateTimeLiteral "," XsdDateTimeLiteral ")")?
XsdDateExpr ::= "xDate" ("(" XsdDateLiteral "," XsdDateLiteral ")")?
XsdTimeExpr ::= "xTime" ("(" XsdTimeLiteral "," XsdTimeLiteral ")")?
UkDateExpr ::= "ukDate" ("(" UkDateLiteral "," UkDateLiteral ")")?
DateExpr ::= "date(" StringProvider "," StringProvider "," StringProvider ("," XsdDateLiteral "," XsdDateLiteral)? ")"
PartialUkDateExpr ::= "partUkDate"
PartialDateExpr ::= "partDate(" StringProvider "," StringProvider "," StringProvider ")"
Uuid4Expr ::= "uuid4"
PositiveIntegerExpr ::= "positiveInteger"
StringProvider ::= ColumnRef | StringLiteral
ExternalSingleExpr ::= ExplicitContextExpr? (FileExistsExpr | ChecksumExpr | FileCountExpr)
FileExistsExpr ::= "fileExists" ("(" StringProvider ")")?
ChecksumExpr ::= "checksum(" FileExpr "," StringLiteral ")"
FileExpr ::= "file(" (StringProvider ",")? StringProvider ")"
FileCountExpr ::= "fileCount(" FileExpr ")"
ParenthesizedExpr ::= "(" ColumnValidationExpr+ ")"
ConditionalExpr ::= IfExpr
IfExpr ::= "if(" (CombinatorialExpr | NonConditionalExpr) "," ColumnValidationExpr+ ("," ColumnValidationExpr+)? ")"

Lexical (Terminal Symbols)

XsdDateTimeLiteral ::= XsdDateWithoutTimezoneComponent "T" XsdTimeLiteral
XsdDateLiteral ::= XsdDateWithoutTimezoneComponent XsdTimezoneComponent
XsdTimeLiteral ::= XsdTimeWithoutTimezoneComponent XsdTimezoneComponent
XsdDateWithoutTimezoneComponent ::= -?[0-9]{4}-(((0(1|3|5|7|8)|1(0|2))-(0[1-9]|(1|2)[0-9]|3[0-1]))|((0(4|6|9)|11)-(0[1-9]|(1|2)[0-9]|30))|(02-(0[1-9]|(1|2)[0-9]))) /* xgc:regular-expression */
XsdTimeWithoutTimezoneComponent ::= ([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(\.[0-9]{3})? /* xgc:regular-expression */
XsdTimezoneComponent ::= ((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z)? /* xgc:regular-expression */
UkDateLiteral ::= (((0[1-9]|(1|2)[0-9]|3[0-1])\/(0(1|3|5|7|8)|1(0|2)))|((0[1-9]|(1|2)[0-9]|30)\/(0(4|6|9)|11))|((0[1-9]|(1|2)[0-9])\/02))\/[0-9]{4} /* xgc:regular-expression */
PositiveNonZeroIntegerLiteral ::= [1-9][0-9]* /* xgc:regular-expression */ /* A Natural Number, positive integer */
PositiveIntegerLiteral ::= [0-9]+ /* xgc:regular-expression */ /* A Natural Number, non-negative integer */
NumericLiteral ::= -?[0-9]+(\.[0-9]+)? /* xgc:regular-expression */ /* A Real Number, expressed as an integer or decimal */
StringLiteral ::= "\"" [^"]* "\"" /* xgc:regular-expression */ /* Any characters except: quotation mark */
CharacterLiteral ::= "'" [^\r\n\f'] "'" /* xgc:regular-expression */ /* Any characters except: carriage-return, line-break, form-feed and apostrophe */
WildcardLiteral ::= "*"
Ident ::= [A-Za-z0-9\-_\.]+ /* xgc:regular-expression */

Extra-grammatical Constraints

xgc:regular-expression

The right-hand side of the expression is expressed using a Regular Expression. The Regular Expression syntax used is that from Java's Pattern class.

xgc:unordered

Implies that each distinct symbol in the expression may appear in any order.

Acknowledgements

Many thanks to: