This document represents the specification of the CSV Schema Language 1.1 as defined by The National Archives. It is unclear yet whether this document will be submitted to a formal standards body such as the W3C. This version supersedes the original CSV Schema Language 1.0 published on 28 August 2014.
CSV (Comma Separated Value) data comes in many shapes and sizes. Apart from [[RFC4180]] which is a fairly recent development (and often ignored), there is a lack of formal definition as to CSV data formats, although in many ways this is one of the strengths of the CSV data format. However, extracting structured information from CSV data for further processing or storage can prove difficult if the CSV data is not well understood or perhaps not even uniform. CSV Schema defines a textual language which can be used to define the data structure, types and rules for CSV data formats.

Introduction

The intention of this document is two-fold:

  1. To be informative to users who are writing CSV Schemas, and provide a reference to the available syntax and functions.
  2. To provide enough detail such that anyone with sufficient technical expertise should be able to implement a CSV Schema parser and/or CSV validator evaluating the rules defined in a CSV Schema.

Background

The National Archives DRI (Digital Repository Infrastructure) system archives digitised and born-digital materials provided by OGDs (Other Government Departments) and occasionally NGOs (Non-Governmental Organisations). For the purposes of Digital Preservation the system processes and archives large amounts of metadata, much of this metadata is created by the supplying organisation or by transcription. The metadata is further processed, and ultimately stored both online in an RDF Triplestore and a majority subset archived in a non-RDF XML format. However it was recognised that the creation of XML or RDF metadata by the supplier was most likely unrealistic for either technical or financial reasons. As such, CSV was recognised as a simple data format that is human readable (to a degree), that almost anyone could create simply; CSV is the lowest common denominator of structured data formats.

The National Archives have strict rules about various CSV file formats that they expect, and how the data in those file formats should be set out. To ensure the quality of their archival metadata it was recognised that CSV files would have to be validated. It was recognised that development of a schema language for CSV (and associated tools) would be of great benefit. It was also further recognised that a general CSV Schema language would be of greater benefit if it was made publicly available and invited collaboration from other organisations and individuals; the problem of CSV data formats is certainly not unique to The National Archives.

CSV Schema is a standard currently guided by The National Archives, but developed in an open source collaborative manner that invites collaboration and contributions from all interested parties.

A reference implementation has been created to prove the standard: The open source CSV Validator application and API, offers both CSV Schema parsing and CSV file validation.

Guiding Principles

The design of the CSV Schema language has been influenced by a few guiding principles, understanding these will help you to understand how and why it is structured the way that it is.

Basics

A CSV Schema is really a rules based language which defines how data in each cell should be formatted. Rules are expressed per-column of the CSV data. Rules are evaluated for each row in the CSV data. A column rule may express constraints based on the content of other columns in the same row, however at present there is no scope for looking forward or backward through rows directly. However, it is possible to check that a cell entry is unique within that column in the CSV file (or that the value of a combination of cells is unique)

A CSV Schema is made up of two main parts:

  1. Prolog

    In turn this comprises (at most) two sections (the second being OPTIONAL):

    1. Version Declaration

      The CSV Schema MUST explicitly state (as its first non-comment line) the version of the CSV Schema language that it uses. This is to allow for future evolution of the CSV Schema language to be easily handled by CSV Schema processors.

    2. Global Directives

      Global Directives apply to all processing of the CSV data. Global Directives for example allow you to define the separator sequence between columns in the CSV data. Global Directives appear before Column Rules and are prefixed with an @ character.

      The use of Global Directives is OPTIONAL, default values are used if they are not specified.

  2. Body

    The Body of the CSV Schema MUST declare, in order, a Column Rule for each Column in the CSV data. If validation of a Column is not desirable, then an empty rule is used.

Let's now illustrate a simple CSV Schema that is concerned with CSV data about names, ages and gender:

version 1.1
@totalColumns 3
name: notEmpty
age: range(0, 120)
gender: is("m") or is("f") or is("t") or is("n") 
            
This CSV Schema basically defines that the CSV data must have 3 columns: the first column, name, must have some sort of value; the second column, age, must be a number between 0 and 120 inclusive; and the third column, gender, must be one of the characters m, f, t or n. An example of CSV data that would match the rule definitions in the CSV schema could be as follows:
name,age,gender
james,21,m
lauren,19,f
simon,57,m
        

An example of CSV data would fail the rule definitions in the CSV schema could be as follows:

name,age,gender
james,4 years,m
lauren,19,f
simon,57,male
        

The Invalid CSV Data example above fails when validated against the CSV Schema because: 1) at row 2 column 2, "4 years" is not a number between 1 and 120 inclusive, and 2) at row 4 column 3, "male" is not one of the characters m, f, t, or n.

New in CSV Schema Language 1.1 - A brief introduction to the new features of CSV Schema Language 1.1

The last 18 months with CSV Schema Language 1.0 being in regular use at The National Archives has highlighted a few additional Column Validation Expressions that would provide further useful validation, simplify schema writing, or make schemas more readable. In addition the concept of a String Provider has been extended to allow concatenation to produce a final string input to expressions from some set of other String Providers, and also a function to allow removal of a Windows file extension to make certain comparisons more straightforward and robust.

The new Column Validation Expressions are the:

  1. Upper Case Expression, which asserts that all the characters in the column must be uppercase (according to the definitions in [[UTF-8]]).
  2. Lower Case Expression, which asserts that all the characters in the column must be lowercase (according to the definitions in [[UTF-8]]).
  3. Integrity Check Expression, this is effectively the converse of the File Exists Expression, checking that if there is a file present in the folders referred to in a CSV file, it has an explicit reference within that CSV file.
  4. XSD Date Time With Time Zone Expression, this expression adapts the existing XSD Date Time Expression to make the timezone portion mandatory.
  5. Identical Expression, this asserts that all values in a certain column must be identical, but does not specify the precise value within the schema. Within the CSV files received by The National Archives this is expected to be used in conjunction with a Regular Expression Expression to give the general form for a batchcode field for a project, each line in a CSV should have the same batchcode, but we do not want to update the schema for each batch received to state the exact batchcode.
  6. Any Expression, this is effectively a combination of the Is Expression and the Or Expression into one expression.
  7. Switch Expression, this allows a flatter expression of what would otherwise have to be expressed as a set of nested If Expressions. Such expressions can be hard to read and maintain due to the number of sets of brackets that can be involved.
In addition the Range Expression has been extended to allow the creation of ranges that have only an upper or lower bound using similar syntax to that employed by the Length Expression. This also required the addition of Numeric Or Any type (to allow ranges to have negative values as bounds), or the Wildcard Literal. In the Length Expression we used only Positive Integer Or Any since a negative value has no sensible meaning for the length of a field. There is also one new Global Directive, the Permit Empty Directive. This allows a file with no data rows to be treated as valid.

Schema structure

The CSV schema language is formally a context-free grammar expressed in Extended Backhaus-Naur Form (EBNF - see also [[RFC5234]])

The following subsections examine the structure of a CSV Schema in more detail. Each subsection comprises definitions of terms, cross-references to other definitions, the relevant portion of the EBNF (links on the lefthandside go to the appendix containing the full EBNF, those on the right to a fuller explanation of those term(s)), and examples of correct usage.

A Schema MUST comprise both Prolog and Body.

[1] Schema ::= Prolog Body

Prolog

The Prolog of a CSV Schema MUST contain the Version Declaration and MAY contain one or more Global Directives.

[2] Prolog ::= VersionDecl GlobalDirectives

Version Declaration

The Version Declaration declares explicitly which version of the CSV Schema language is in use. This MUST be either version 1.0 or version 1.1. If the version is not valid this is considered a Schema Error. If the version is declared as 1.0 but the CSV Schema attempts to use features of 1.1 this is also considered a Schema Error. The Version Declaration is MANDATORY.

[3] VersionDecl ::= ("version 1.0" | "version 1.1")

Example Version Declaration

		version 1.0
						

Global Directives

The Global Directives allow you to modify the overall processing of a CSV file or how subsequent Column Definitions are evaluated. The use of Global Directives within a CSV Schema is OPTIONAL. The last two Global Directives described (No Header Directive and Ignore Column Name Case Directive) are mutually exclusive, they MUST NOT both be used in a single schema. There is no inherent reason why the Global Directives should be in the order shown, EBNF does not directly cater for unordered lists. You could explicitly list each possible ordering, but that would require 4!=24 orderings to be included in the ENBF. All directives (both Global Directives and Column Directives) used in the CSV Schema are indicated by the Directive Prefix, defined as the character @ i.e. the [[UTF-8]] character code 0x40.

Whitespace is not generally significant, so Global Directives can be entered all on a single line, or each on separate lines (see ).

[4] GlobalDirectives ::= SeparatorDirective? QuotedDirective? TotalColumnsDirective? PermitEmptyDirective? (NoHeaderDirective | IgnoreColumnNameCaseDirective)? /* xgc:unordered */
[5] DirectivePrefix ::= "@"

Separator Directive

The Separator Directive allows you to specify the separator character that is used between columns in the CSV data. As with all Global Directives the Separator Directive is OPTIONAL, if not supplied the default value is assumed. By default the separator is a comma (,) i.e. the [[UTF-8]] character code 0x2c (as specified in [[RFC4180]]).

The Separator Directive takes a MANDATORY parameter in the form of either a Separator Tab Expression or a Separator Character.

A Separator Tab Expression indicates that the separator comprises a tab character, i.e. [[UTF-8]] character code 0x09.

A Separator Character is a Character Literal: the character which is to be treated as the column separator.

[6] SeparatorDirective ::= DirectivePrefix "separator" (SeparatorTabExpr | SeparatorChar)
[7] SeparatorTabExpr ::= "TAB" | '\t'
[8] SeparatorChar ::= CharacterLiteral

Quoted Directive

The Quoted Directive allows you to specify whether or not all columns are quoted. That is to say that their values are encased in quotation mark characters, i.e. [[UTF-8]] character code 0x22. In practice most CSV libraries are able to detect and handle the presence or absence of quotes and handle it appropriately, but implementations of this schema language should be able to decide how to handle this situation.

[9] QuotedDirective ::= DirectivePrefix "quoted"

Total Columns Directive

The Total Columns Directive allows you to specify the total number of data columns expected to make up each row of the CSV file. The parser will also verify that the Body of the CSV Schema contains the same number of Column Rules, a mismatch is considered a Schema Error. The Total Columns Directive is OPTIONAL: when this directive is not used this verification of the number of Column Rules cannot be performed, and it will be assumed that you have supplied the correct number of Column Rules.

[10] TotalColumnsDirective ::= DirectivePrefix "totalColumns" PositiveNonZeroIntegerLiteral

Permit Empty Directive

This is a new expression in CSV Schema Language 1.1

The Permit Empty Directive allows you to specify that the CSV file can be empty: i.e. there is no row data. The Permit Empty Directive is OPTIONAL: when not present an empty file will cause a validation error. The Permit Empty Directive can be used in conjunction with the No Header Directive thereby permitting a completely empty CSV file. If the No Header Directive is not present then a minimum of one row containing column names must be provided.

[11] PermitEmptyDirective ::= DirectivePrefix "permitEmpty"

No Header Directive

The No Header Directive is used to indicate that the CSV file to be validated does not contain a header row: i.e. all rows are data rows. The No Header Directive is OPTIONAL: when this directive is not used the parser assumes by default that the first row of the CSV file to be validated contains column names, not data, and so the first row is skipped during validation.

The No Header Directive is mutually exclusive to the use of the Ignore Column Name Case Directive, when one is used, the other MUST NOT be.

[12] NoHeaderDirective ::= DirectivePrefix "noHeader"

Ignore Column Name Case Directive

The Ignore Column Name Case Directive is intended to tell the parser to ignore mismatches in case between the Column Identifiers supplied in a CSV file to be validated and those used in giving the Column Definitions in the schema.

The Ignore Column Name Case Directive is mutually exclusive to the use of the No Header Directive, when one is used, the other MUST NOT be.

[13] IgnoreColumnNameCaseDirective ::= DirectivePrefix "ignoreColumnNameCase"

Example Global Directives

	@separator ';' @quoted @totalColumns 21 @noHeader
					
	@separator TAB
	@quoted
	@totalColumns 21
	@permitEmpty
	@ignoreColumnNameCase
					

Body

The Body of a CSV Schema comprises at least one Body Part, each of which is a combination of OPTIONAL Comments with a Column Definition (in either order). A Column Definition MUST be included.

[14] Body ::= BodyPart+
[15] BodyPart ::= Comment* ColumnDefinition Comment*

Comments

There are two types of Comment: either Single Line or Multiple Line.

A Single Line Comment is started with two forward slashes (//), i.e. the [[UTF-8]] character codes 0x2F 0x2F. It is terminated by any [[UTF-8]] character that creates a line-break.

A Multiple Line Comment is started using the combination of a forward slash (/) and an asterisk (*), i.e. the [[UTF-8]] character codes 0x2F 0x2A. It is terminated by the reverse combination, asterisk (*) forward slash (/), i.e. the [[UTF-8]] character codes 0x2A 0x2F. Any [[UTF-8]] character except asterisk may be used between these comment markers, even if it forces a new line. Comments do not need to start at the beginning of a line, but must be either before or after a complete Column Definition or another Comment.

[16] Comment ::= SingleLineComment | MultiLineComment
[17] SingleLineComment ::= //[\S\t ]* /* xgc:regular-expression */
[18] MultiLineComment ::= \/\*(?:[^*\r\n]+|(?:\r?\n))*\*\/ /* xgc:regular-expression */

Example Comments

	//This Comment is a Single Line Comment it terminates at this line break
	/*This Comment is a Multi Line Comment:
	
	
	it
		
	can
	
	go
	
	on
	
	for as many lines as you like, until you type*/
					

Column Definitions

Column Definitions comprise a Column Identifier or Quoted Column Identifier followed by a colon (:), i.e. the [[UTF-8]] character code 0x3A, followed by a Column Rule. There MUST be a Column Definition for every column in the CSV that will be validated against the Schema, however the Column Rule can be left empty if no validation is needed for a specific column.

[19] ColumnDefinition ::= (ColumnIdentifier | QuotedColumnIdentifier) ":" ColumnRule

Column Identifiers

There are two classes of identifier that can be used for columns, the original simple Column Identifier, and the Quoted Column Identifier.

A Column Identifier is either a Positive Non Zero Integer Literal (most commonly used when the CSV file to be validated has no header row - see No Header Directive) which indicates the offset of the column (starting from 1), or an Ident.

The Quoted Column Identifier allows a greater range of characters to be used in naming the column than can be supported by an Ident, but the identifier MUST be wrapped in quotation marks ("), i.e. the [[UTF-8]] character code 0x22 (this is implicit in its definition as a String Literal).

Identifiers MUST be unique within a single Schema.

[20] ColumnIdentifier ::= PositiveNonZeroIntegerLiteral | Ident
[21] QuotedColumnIdentifier ::= StringLiteral
Usage
	a_column_identifier
	"a quoted column identifier"
						

Column Rules

A Column Rule is a combination of any number of Column Validation Expressions, along with OPTIONAL Column Directives. You MAY use an empty Column Rule if there is no requirement for an individual column to be validated.

As Column Validation Expressions are the primary means of applying validation, they are described in their own full section of this document. The range and variety of expressions available make supplying comprehensive examples here impractical, though some will be used to show the basic structure of a Column Rule.

White space is not generally important within a Column Rule, but the whole rule must be on a single line.

[22] ColumnRule ::= ColumnValidationExpr* ColumnDirectives
Column Directives

There are four OPTIONAL Column Directives that are used to modify aspects of how the Column Rules are evaluated. Like Global Directives, Column Directives are indicated by the Directive Prefix, defined as the character @ i.e. the [[UTF-8]] character code 0x40.

The Column Directives are the Optional Directive, the Match Is False Directive, the Ignore Case Directive, and the Warning Directive. The column directives may be specified in any order (though there is no straightforward way to express this in EBNF without listing every possible order).

[23] ColumnDirectives ::= OptionalDirective? MatchIsFalseDirective? IgnoreCaseDirective? WarningDirective /* xgc:unordered */
Optional Directive

The Optional Directive is used when completion of the data field in the original CSV file to be validated is OPTIONAL. When this directive is used the data in the column is considered valid if the Column Rule evaluates to true, or if the column is empty.

[24] OptionalDirective ::= DirectivePrefix "optional"
Match Is False Directive

The Match Is False Directive is used to flip the result of a test from negative to positive (or vice-versa). It may be very simple to write a condition which matches the data considered to be invalid, while the equivalent for valid data would be very convoluted.

[25] MatchIsFalseDirective ::= DirectivePrefix "matchIsFalse"
Ignore Case Directive

The Ignore Case Directive is used when the case of a column value is not important. Two strings which differ only in the case used for characters within the string would be considered a match for all string related column rules.

[26] IgnoreCaseDirective ::= DirectivePrefix "ignoreCase"
Warning Directive

The Warning Directive is used to convert a Validation Error into a Validation Warning. This is useful if you wish to be alerted to a data condition which is unusual, but not necessarily invalid. For instance, at The National Archives we have come across archival material where the clerk who originally completed a form wrote down an "impossible" date such as 30 February or 31 April. We have to do our best to accept the data as originally supplied (we have no idea if it is the day or month of the date which is actually incorrect), but we also wish to ensure that additional Quality Assurance checking is performed to ensure this is not a transcription error. Warnings are listed in the validation report, but the data file MAY still considered valid if only warnings are present.

[27] WarningDirective ::= DirectivePrefix "warning"

Column Definitions examples

	a_column_title:			is("somedata") or is("otherdata")        	@optional @matchIsFalse @ignoreCase @warning
	another_column_title:		not("somedata") and not("otherdata")	@ignoreCase
					

The two Column Definitions are both validating the data in their respective columns against the explicit strings somedata and otherdata. Ignoring the Column Directives for the moment, the column rule defined for a_column_title would return true if the CSV data for that column contained either the precise string somedata or otherdata. However, the Optional Directive means a completely empty column would also be acceptable. Also, since the Ignore Case Directive is also applied, the strings SomeData or OTHERDATA (for example) would also be acceptable. But, since the Match Is False Directive is in effect, the validation is inverted, it would actually be any string other than the two specified which would be regarded as acceptable data. Since the Warning Directive is also used, a validation failure would not be considered an error though.

The second Column Definition (with the effect of the Match Is False Directive on the first taken into account) is actually logically equivalent to the first (see De Morgan's Laws). However, since the Optional Directive has not been used, an empty column would not be valid data, and since the Warning Directive has also not been included, a Validation Warning would be raised instead of a Validation Error.

Column Validation Expressions

The key building blocks for Column Rules are Column Validation Expressions. These are divided into two main classes, Non Conditional Expressions and Conditional Expressions. Non Conditional Expressions boil down to checks resulting in a pass or fail (a number of expressions may be combined to produce an overall validation check), Conditional Expressions allow for more subtle checks, if for example you are validating a title column which allows the values Mr, Mrs, Ms, Miss and Dr, You could construct a Conditional Expression which also checks the sex column and if that contains female, then Mr would be regarded as invalid (strictly speaking that would also require the use of an Explicit Context Expression to refer to the other column, but that is a subexpression of the Non Conditional Expression class).

NOTE To increase control over expression applicability and to avoid creating a left-recursive grammar (which could lead to problems for various parser implementations), Column Validation Expressions have been further split into Combinatorial Expressions and Non Combinatorial Expressions.

[28] ColumnValidationExpr ::= CombinatorialExpr | NonCombinatorialExpr

Combinatorial Expressions

A Combinatorial Expression combines one or more Column Validation Expressions, which allows more complicated tests on the validity of data in a column. There are two types, Or Expressions and And Expressions. They are of equal precedence, and evaluation of Column Validation Expressions is performed from left-to-right. See also the Any Expression which is logically equivalent to a series of Is Expressions joined by Or Expressions.

[29] CombinatorialExpr ::= OrExpr | AndExpr

Or Expressions

An Or Expression is used as a standard boolean operator to indicate that the column data should be treated as being valid if either (or both) the expressions linked by the Or Expression evaluate to true.

[30] OrExpr ::= NonCombinatorialExpr "or" ColumnValidationExpr

And Expressions

An And Expression is used as a standard boolean operator to indicate that the column data should be treated as being valid when both the expressions linked by the And Expression evaluate to true. Use of an explicit And Expression is OPTIONAL: if two Column Validation Expressions are written in succession for the same column they will be treated as having an implicit And Expression joining them.

[31] AndExpr ::= NonCombinatorialExpr "and" ColumnValidationExpr

Non Combinatorial Expressions

A Non Combinatorial Expression is a Column Validation Expression which is evaluated by itself, unless it is combined with another through a Combinatorial Expression. The majority of Column Validation Expressions are of the Non-Combinatorial Expression class.

[32] NonCombinatorialExpr ::= NonConditionalExpr | ConditionalExpr

Non Conditional Expressions

Non Conditional Expressions are divided into three classes of sub-expressions: Single Expressions, External Single Expressions, and Parenthesized Expressions. The first two are individual validation checks (differing in that the second allows access to some resource outside the CSV file being validated), whilst the last provides a mechanism for controlling the evaluation order of complex compound expressions.

[33] NonConditionalExpr ::= SingleExpr | ExternalSingleExpr | ParenthesizedExpr

Single Expressions

Single Expressions are the basic building blocks of Column Rules. There are currently 27 available for use as of CSV Schema Language 1.1 (and some have their own subexpressions used as parameters), although the first is really used as an OPTIONAL modifier to the rest. In many cases values can be provided to the test either as an explicit string (or number where appropriate), or by reference to the value held by another column.

[34] SingleExpr ::= ExplicitContextExpr? (IsExpr | AnyExpr | NotExpr | InExpr | StartsWithExpr | EndsWithExpr | RegExpExpr | RangeExpr | LengthExpr | EmptyExpr | NotEmptyExpr | UniqueExpr | UriExpr | XsdDateTimeExpr | XsdDateTimeWithTimeZoneExpr | XsdDateExpr | XsdTimeExpr | UkDateExpr | DateExpr | PartialUkDateExpr | PartialDateExpr | Uuid4Expr | PositiveIntegerExpr | UpperCaseExpr | LowerCaseExpr | IdenticalExpr)

Explicit Context Expressions

The Explicit Context Expression is used to indicate that the expression following should be tested against the value in a foreign column (explicit context), rather than the current column (which is the default context).

[35] ExplicitContextExpr ::= ColumnRef "/"
[36] ColumnRef ::= "$" (ColumnIdentifier | QuotedColumnIdentifier)
Usage
	a_column: is("some string") and $another_column/startsWith("some string") //here two tests are combined on a single line, the second test here looks to the second column 
	another_column:                                                           //to check it's value starts with "some string"
							

Is Expressions

An Is Expression checks that the value of the column is identical to the supplied string or the value in the referenced column.

[37] IsExpr ::= "is(" StringProvider ")"
Usage
	a_column: is("some string")    //the contents of a_column must be the string "some string"
	another_column: is($a_column)  //the contents of another_column must be the value held in a_column, treated as a string
							

Any Expressions

This is a new expression in CSV Schema Language 1.1

An Any Expression checks that the value of the column is identical to one of the supplied strings or the values in the referenced columns. This is logically equivalent to a series of Is Expressions joined by Or Expressions, but slightly more compact to write and maintain.

[38] AnyExpr ::= "any(" StringProvider ("," StringProvider)* ")"
Usage
	a_column: is("some string")
	another_column: any("some other string",$a_column,"another string") //any of the string values given here are valid, including referencing the string held in a_column
							

Not Expressions

A Not Expression checks that the value of the column is not equal to the supplied string or the value in the referenced column.

[39] NotExpr ::= "not(" StringProvider ")"
Usage
	a_column: not("some string")     //the value of a_column must not be the string "some string"
	another_column: not($a_column)   //the value of another_column must not be the value held in a_column
							

In Expressions

An In Expression checks that the value of the column contains the supplied string or the value in the referenced column (i.e. the column value is a super string of the supplied value).

[40] InExpr ::= "in(" StringProvider ")"
Usage
	a_column: in("some string")   //the value of a_column must be a substring of "some string" eg "some" or "string" or "me st" etc
	another_column: in($a_column) //the value of another_column must be a substring of the contents of a_column
							

Starts With Expressions

A Starts With Expression checks that the value of the column starts with the supplied string or the value in the referenced column.

[41] StartsWithExpr ::= "starts(" StringProvider ")"
Usage
	a_column: startsWith("some string")   //the value of a_column must start with the string "some string" eg "some strings" or "some string is here that's really long"
	another_column: startsWith($a_column) //the value of another_column must start with the contents of a_column
							

Ends With Expressions

An Ends With Expression checks that the value of the column ends with the supplied string or the value in the referenced column.

[42] EndsWithExpr ::= "ends(" StringProvider ")"
Usage
	a_column: endsWith("some string")   //the value of a_column must end with the string "some string" eg "here is some string" or "this really long string ends with some string"
	another_column: endsWith($a_column) //the value of another_column must end with the contents of a_column
							

Regular Expression Expressions

A Regular Expression Expression checks the value of the column against the supplied Regular Expression.

Whilst obviously many of the the other Column Validation Expressions could be written as Regular Expressions, it is felt that that would make the resulting Schema harder to read, and much harder to write for less technical users. As such it is recommended, that if a Column Rule can be written without regular expressions, by instead using other Column Validation Expressions, then that approach should be attempted first.

The Regular Expression syntax used is that from Java's Pattern class.

[43] RegExpExpr ::= "regex(" StringLiteral ")"
Usage
	a_column: regex("[bcm]at")   //the value of a_column must match the regular expression [bcm]at ie a string containing "bat", "cat" or "mat"
	another_column: regex("[0-5]") //the value of another_column match the regular expression [0-5] ie a string containing only the digits only 0-5.
						

Range Expressions

The definition of this expression in CSV Schema Language 1.1 extends the definition originally made in CSV Schema Language 1.0

A Range Expression checks that the value of the column is a number lying between, or equal to, the supplied upper and lower bounds. One or other bounding value MAY be left unrestricted (by using the Wildcard Literal), so that the expression can also be used to check that a column is at least some value, or at most some value. One bounding value is defined as a Numeric Or Any expression which accepts a Numeric Literal (a Real Number expressed as a decimal), or the Wildcard Literal, while the other is a Numeric Literal only. Therefore at least one of the bounding values MUST be a Numeric Literal rather than a Wildcard Literal, so valid Range Expressions SHALL define ranges of:

  1. At least n. For example range(10, *) states that the column value MUST be at least 10
  2. At most n. For example range(*, 10) ensures that the column value MUST be at most 10 (and may be negative).
  3. Between m and n(inclusive). For example range(4, 7) ensures that the column value MUST be between 4 and 7 (inclusive).

[44] RangeExpr ::= "range(" (NumericOrAny "," NumericLiteral | NumericLiteral "," NumericOrAny) ")"
[45] NumericOrAny ::= NumericLiteral | WildcardLiteral

Length Expressions

A Length Expression checks that the number of characters in the column meets the supplied definition. You can define the length in one of four ways:

  1. Precisely n characters long. For example length(10) ensures that the length is exactly 10 characters.
  2. At least n characters long. For example length(10, *) ensures that the length is at least 10 or more characters
  3. At most n characters long. For example length(*, 10) ensures that the length is at most 10 characters.
  4. Between m and n characters long (inclusive). For example length(4, 7) ensures that the length is at least 4 characters and at most 7 characters.

In order to do this the expression takes two input parameters, the first is defined as OPTIONAL, the second MUST be supplied. Both take the form of a Positive Integer Or Any expression, which is either a Positive Integer Literal (which actually includes zero), or a Wildcard Literal.

[46] LengthExpr ::= "length(" (PositiveIntegerOrAny ",")? PositiveIntegerOrAny ")"
[47] PositiveIntegerOrAny ::= PositiveIntegerLiteral | WildcardLiteral

Empty Expressions

An Empty Expression checks that the column has no content.

[48] EmptyExpr ::= "empty"
Usage
	a_column: empty //there must be no value in a_column
							

Not Empty Expression

A Not Empty Expression checks that the column has some content, though precisely what does not matter.

[49] NotEmptyExpr ::= "notEmpty"
Usage
	a_column: notEmpty   //there must be some value in a_column, but it can be absolutely anything
							

Unique Expressions

A Unique Expression checks that the column value is unique within the CSV file being validated (within the current column, the value may occur elsewhere in the file in another column, as in a primary-foreign key relationship in a database). You can also specify a comma separated list of Column References in which case the combination of values of those columns (for the current row) must be unique within the whole CSV file.

[50] UniqueExpr ::= "unique" ("(" ColumnRef ("," ColumnRef)* ")")?
Usage
	a_column: unique                                  //a_column must hold a unique value in each row of the CSV file
	another_column: unique($a_column,$another_column) //the combination of values in the two columns must be unique looking across each row of the CSV file
							

URI Expressions

A URI Expression means that the value in the column MUST be a valid URI as defined in [[!RFC3986]].

[51] UriExpr ::= "uri"
Usage
	a_column: uri   //the value of a_column must be a valid URI
							

XSD Date Time Expressions

An XSD Date Time Expression checks that the data in the column is expressed as a valid XML Schema dateTime data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]). You can also provide an OPTIONAL from and to date-times (inclusive) to ensure that the value in the column falls within an expected date-time range.

As the XSD Date Time Expression uses the XSD Date Time Literal the final, timezone, part of the [[!ISO8601]] definition is OPTIONAL.

[52] XsdDateTimeExpr ::= "xDateTime" ("(" XsdDateTimeLiteral "," XsdDateTimeLiteral ")")?
Usage
	a_column: xDateTime                                                 //the value of a_column must be a valid xDateTime
	another_column: xDateTime(2014-10-04T00:00:01Z,2015-12-03T23:59:59) //the value of another_column must be a valid xDateTime and between the two xDateTimes shown (inclusive)
	                                                                    //as shown, the xDateTime values may, or may not, have a component indicating a specific timezone, here Z (Zulu) for UTC (Greenwich Mean Time)
							

XSD Date Time With Time Zone Expressions

An XSD Date Time With Time Zone Expression is identical to XSD Date Time Expression except that it uses XSD Date Time With Time Zone Literal rather than XSD Time Literal, which means that the time zone component in [[!XMLSCHEMA-2]] and [[!ISO8601]] MUST be used. Again, you can also provide an OPTIONAL from and to date-times, (inclusive) to ensure that the value in the column falls within an expected date-time range. These are also defined as XSD Date Time With Time Zone Literal so MUST have the time zone component.

[53] XsdDateTimeWithTimeZoneExpr ::= "xDateTimeTz" ("(" XsdDateTimeWithTimeZoneLiteral "," XsdDateTimeWithTimeZoneLiteral ")")?
Usage
	a_column: xDateTime                                                            //the value of a_column must be a valid xDateTime
	another_column: xDateTime(2014-10-04T00:00:01+02:00,2015-12-03T23:59:59+02:00) //the value of another_column must be a valid xDateTime and between the two xDateTimes shown (inclusive)
	                                                                               //now the time zone component (+02:00) must be included
							

XSD Date Expressions

An XSD Date Expression checks that the data in the column is expressed as a valid XML Schema date data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]). You can also provide OPTIONAL from and to dates (inclusive) to ensure that the value in the column falls within an expected date range.

[54] XsdDateExpr ::= "xDate" ("(" XsdDateLiteral "," XsdDateLiteral ")")?
Usage
	a_column: xDate                                  //the value of a_column must be a valid xDate
	another_column: xDateTime(2014-10-04,2015-12-03) //the value of another_column must be a valid xDate and between the two xDates shown (inclusive)
							

XSD Time Expressions

An XSD Time Expression checks that the data in the column is expressed as a valid XML Schema time data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]). You can also provide OPTIONAL from and to times (inclusive) to ensure that the value in the column falls within an expected time range.

[55] XsdTimeExpr ::= "xTime" ("(" XsdTimeLiteral "," XsdTimeLiteral ")")?
Usage
	a_column: xTime                                                 //the value of a_column must be a valid xTime
	another_column: xTime(00:00:01+02:00,2015-12-03T23:59:59+02:00) //the value of another_column must be a valid xTime and between the two xTimes shown (inclusive)
	                                                                //the time zone component (+02:00) is optional
							

UK Date Expressions

A UK Date Expression checks that the data in the column is expressed as a valid UK-style date: dd/mm/yyyy, a UK Date Literal. You can also provide OPTIONAL from and to dates (inclusive) to ensure that the value in the column falls within an expected date range.

[56] UkDateExpr ::= "ukDate" ("(" UkDateLiteral "," UkDateLiteral ")")?
Usage
	a_column: ukDate                              //the value of a_column must be a valid ukDate
	another_column: ukDate(04/10/2014,03/12/2015) //the value of another_column must be a valid ukDate and between the two ukDates shown (inclusive)
							

Date Expression

A Date Expression allows a full date to be constructed from several columns, or strings. The expression takes five arguments: you MUST supply three strings or Column References representing Year, Month and Day (although supplied as strings, these values must in fact be integers); and there are two OPTIONAL parameters to ensure the date falls in a range specified by from XSD Date Expression and to XSD Date Expression.

[57] DateExpr ::= "date(" StringProvider "," StringProvider "," StringProvider ("," XsdDateLiteral "," XsdDateLiteral)? ")"
Usage
	year_column: date($year_column,$month_column,$day_column)                      /*the full date to be checked is made up from the values from the three columns indicated, 
	                                                                                 as shown, the columns can appear in CSV file in any order, but must appear in the date expression in the order year, month, day*/
	day_column: date($year_column,$month_column,$day_column,2014-10-04,2015-12-03) //this second example shows that you can also supply from and to dates as xDates, both dates are inclusive
	month_column:
							

Partial UK Date Expression

A Partial UK Date Expression is essentially the same as a UK Date Expression, but allows for difficulties in transcribing from original archival material by accepting a question mark (?), i.e. the [[UTF-8]] character code 0x3F in place of illegible digits in any position, or an asterisk (*), i.e. the [[UTF-8]] character code 0x2A in place of a missing value. The names of months may also be supplied as the full name in English, i.e.: January, February, March, April, May, June, July, September, October, November, December. As dates may not be complete, it is impossible to determine reliably if a date falls within a given range, so there is no option to supply one.

[58] PartialUkDateExpr ::= "partUkDate"
Usage
	a_column: partUkDate //the value of a_column must be a valid ukDate, but may also include characters ? and * to represent illegible or missing characters, and month names in full
							

Partial Date Expression

A Partial Date Expression combines elements of Partial UK Date Expression with those of Date Expression, with the date being made up of columns or strings as in Date Expression, but also allowing the characters representing uncertainty as in Partial UK Date Expression. However, the constituent parts of the date MUST be supplied as Year, Month, Day.

[59] PartialDateExpr ::= "partDate(" StringProvider "," StringProvider "," StringProvider ")"
Usage
	year_column: date($year_column,$month_column,$day_column)                      //the full date to be checked is made up from the values from the three columns indicated, 
	day_column:                                                                    //as shown, the columns can appear in CSV file in any order, but must appear in the date expression in the order year, month, day
	month_column:                                                                  //the date may also include characters ? and * to represent illegible or missing characters
							

UUID4 Expression

A UUID4 Expression checks that the data in the column is in the form of a Version 4 UUID (Universally Unique Identifier), see [[!RFC4122]]. UUIDs MUST use lowercase hex values.

[60] Uuid4Expr ::= "uuid4"
Usage
	a_column: uuid //the value of a_column must be a valid version 4 UUID
							

Positive Integer Expression

A Positive Integer Expression checks that the column contains an integer value, greater than or equal to zero.

[61] PositiveIntegerExpr ::= "positiveInteger"
Usage
	a_column: positiveInteger //the value of a_column must be a positive integer (including zero)
							

Upper Case Expression

This is a new expression in CSV Schema Language 1.1

An Upper Case Expression checks that the column content is all upper case, for all code points in the [[!UTF-8]] character set which have a defined case. Non-cased code points (e.g. numeric characters, punctuation and white space) are also permitted.

In Java's Pattern class this could be expressed via the following Regular Expression: "^[\p{Lu}\p{N}\p{P}\s]*$".

[62] UpperCaseExpr ::= "upperCase"
Usage
	a_column: upperCase //the contents of a_column must be all be upper case characters (Unicode aware), or uncased
							

Lower Case Expression

This is a new expression in CSV Schema Language 1.1

A Lower Case Expression checks that the column content is all lower case, for all code points in the [[!UTF-8]] character set which have a defined case. Non-cased code points (e.g. numeric characters, punctuation and white space) are also permitted.

In Java's Pattern class this could be expressed via the following Regular Expression: "^[\p{Ll}\p{N}\p{P}\s]*$".

[63] LowerCaseExpr ::= "lowerCase"
Usage
	a_column: lowerCase //the contents of a_column must be all be lower case characters (Unicode aware), or uncased
							

Identical Expressions

This is a new expression in CSV Schema Language 1.1

An Identical Expression asserts that the value of the column MUST be identical for every row within a CSV file, without having to specify precisely what that value will be when writing the CSV Schema.

[64] IdenticalExpr ::= "identical"
Usage
	batch_date:	identical xDate
	batch_code:	identical regex("^[A-Z]{3,5}B[0-9]{3}$")
								

Consider a file that is expected to contain data relating to one particular batch of a process, and should also be the data relating to one particular day. Over time we will receive many such files, so we do not want to amend the schema each day to say what the valid date for the file is, or what the appropriate batch_code is. Instead we give a generic rule for the field content: it's an XSD Date Expression in the case of batch_date; or that the batch_code will comprise three to five uppercase letters, followed by an uppercase B, followed by three digits; and for each row in the file every date must be identical and every batch_code must be identical.

External Single Expressions

An External Single Expression allows access to resources outside the CSV file being validated in order to verify some information contained within the file. For example, to check that an image file referenced from within a CSV file actually exists, or that a supplied checksum matches the value calculated for a file. The available expressions are the File Exists Expression, Checksum Expression and File Count Expression. Each may be prefixed with an Explicit Context Expression in order to refer to data in a different column.

[65] ExternalSingleExpr ::= ExplicitContextExpr? (FileExistsExpr | IntegrityCheckExpr | ChecksumExpr | FileCountExpr)

File Exists Expressions

A File Exists Expression checks a filesystem to see if the specified file actually exists at the specified path. It takes an OPTIONAL expression in the form of a String Provider which allows you to supply a string (or reference to a string) with a full filepath to prepend to the contents of the current column (in the case that for example it only contains just the name of the file). The default value for this string is an empty string.

See also the Integrity Check Expression which performs the inverse function of ensuring that all files in a given folder structure have been mentioned in a CSV file.

[66] FileExistsExpr ::= "fileExists" ("(" StringProvider ")")?
Usage
	a_column: fileExists                      //the validator should check the filesystem location indicated by the contents of a_column for the existence of such a file
	another_column: fileExists("file:///C:/") //here the string "file:///C:/" is prepended to the contents of another_column before the existence check is made
							

Integrity Check Expressions

This is a new expression in CSV Schema Language 1.1

An Integrity Check Expression checks a filesystem to see if there are any files present that are not specifically mentioned in the CSV file. It takes three expressions as parameters. The first are two OPTIONAL expressions in the form of String Providers. The first (as in the File Exists Expression) allows you to supply a string (or reference to a string) with a full filepath to prepend to the contents of the current column (in the case that for example it only contains just the name of the file). The default value for this string is an empty string. The second parameter allows a string provider to be given to point to an explicit subdirectory relative to the location of the base path. By default this subdirectory is expected to be called "content". If only a single OPTIONAL parameter is supplied, it will be assumed to be the first, so if you wish to set only the second OPTIONAL parameter, you MUST also explicitly supply the first as an empty string ("") The final expression MUST be supplied. This indicates whether references to subfolders are explicitly included in the CSV file, if the CSV file has a row for each subfolder the exact string "includeFolder" should be given, if the subfolders do not have explicit references, the exact string "excludeFolder" should be given.

Default treatment of case sensitivity should follow the norms of the relevant file system, implementations may wish to include some means to over-ride this, but that is outside the scope of the EBNF.

[67] IntegrityCheckExpr ::= "integrityCheck" "(" (StringProvider ",")? (StringProvider ",")? ("\"includeFolder\"" | "\"excludeFolder\"") ")"
Usage
	a_column: integrityCheck("includeFolder")                     /*the validator should check all file system folders for files that are not referenced in the CSV file
	                                                                the "includeFolder" parameter indicates that there are explicit references to all file folders in the CSV file
																    as the second parameter has not been supplied it defaults to the value "content" meaning that all sub folders must
																	sit within a folder with that name*/
	another_column: integrityCheck("file:///C:/","excludeFolder") //here the string "file:///C:/" is prepended to the contents of another_column before the integrity check is made
	third_column: integrityCheck("","","excludeFolder")           //here as an strings are passed for both optional parameters, we indicate that there is no content folder
							

Checksum Expressions

A Checksum Expression allows the verification of a checksum value supplied in a CSV file by calculating the checksum for a specific file, and comparing it against the supplied value. You can also indicate the type of checksum algorithm to be used. You MUST supply both the file location and a checksum algorithm. File location is given in the form of a File Expression.

The EBNF does not specify valid values for the String Literal representing the checksum algorithm, that is instead implementation defined, however it is strongly recommended that implementations SHOULD at least support: MD5, SHA-1 and SHA-256. It is important to note that the checksum value MUST use lowercase hexadecimal characters only.

[68] ChecksumExpr ::= "checksum(" FileExpr "," StringLiteral ")"
Usage
	file_path: uri                                                           //a full filepath for a file
	file_name:                                                               //a filename only for a file
	a_column: checksum(file($file_path),"SHA-256")                           /*a_column contains the SHA-256 checksum value supplied for the file, the validator
                                                                               then computes the checksum for the file at file_path and reports an error if they do not match*/
	another_column: checksum(file("file:///C:/my_folder/",$file_name),"MD5") /*here the string "file:///C:/my_folder" is prepended to the contents of file_name before the 
	                                                                           checksum verification is performed, another_column has an MD5 checksum value in it
							

File Count Expressions

A File Count Expression allows a column representing the number of files in a particular folder to be verified against the actual files on disk. You MUST provide a File Expression as an input parameter which points to a filesystem folder to compare the count of files against.

[69] FileCountExpr ::= "fileCount(" FileExpr ")"
Usage
	file_path: uri                                      //a full filepath for a folder
	another_column: integer fileCount(file($file_path)) //another_column contains an integer value, this is checked against the number of files in the file system folder at file_path
							

File related sub-expressions

The sub-expression used in conjunction with External Single Expressions in order to provide input is a generic File Expression which itself takes two parameters. The first parameter is OPTIONAL, a String Provider that is prepended to the second parameter to create a full path in the event that a column holds only a filename rather than a full filepath. You MUST supply the second parameter which is a String Provider, which resolves to the name of the file.

[70] FileExpr ::= "file(" (StringProvider ",")? StringProvider ")"
Usage

The File Expression is always used as an input parameter to one of the other file expressions, several examples are given above. Here are explained the two parameters that this expression itself takes.

	file("file:///a/string/for/a/URI/representing/the/location/of/a/file")           /*in the simplest case a single parameter is supplied that's a string representing the full URI of a
	                                                                                   file. Since it's a URI, characters such as space must be escaped (a space becomes %20)
																			           you can either use a string literal as here, or pass a reference to a column using $column_name*/
	file("file:///a/string/to/prepend/to/a/filename/to/make/a/full/path","filename") /*Here you provide a string (or reference to a string) that is the base path to prepend to "filename"
	                                                                                   to get the full path to your file*/
							

Input parameters used in Single Expressions and External Single Expressions

Many Single Expressions and External Single Expressions take a String Provider as an input. A String Provider takes the form of either a Column Reference, a String Literal, Concatenation Expression or a No Extension Argument Provider.

A Column Reference comprises a dollar sign ($), i.e. the [[UTF-8]] character code 0x24, followed by a Column Identifier or Quoted Column Identifier.

The following are new expressions in CSV Schema Language 1.1

The final two string providers are recursive, themselves taking one or more String Providers as arguments, and returning a new String Provider. The Concatenation Expression takes two or more String Providers, returning a new string that is the concatenation of all those supplied. You MUST provide at least two parameters. The No Extension Argument Provider removes anything that appears to be a Windows file extension from the end of a supplied String Provider, and returns a new string. A string that does not contain a full stop (.), i.e. the [[UTF-8]] character codes 0x2D will be returned unchanged.

[71] StringProvider ::= ColumnRef | StringLiteral | ConcatExpr | NoExtExpr
[72] ConcatExpr ::= "concat(" StringProvider ("," StringProvider)+ ")"
[73] NoExtExpr ::= "noExt(" StringProvider ")"

Usage

	a_column: is("no file")                                     
	another_column: any("file:///","http://")
	third_column: any("C:","example.com")
	fourth_column: ends(".html")
	fifth_column: is($a_column) or is(concat($another_column,$third_column,"/",noExt($fourth_column),".pdf")
	/*in this rather artificial example, fifth_column must be either "no file" (the value of a_column) or a PDF file with the same basic name as the HTML file named in fourth_column,
	  located at either file:///C:/ or http://example.com/ (in fact as written you could have file:///example.com/ or http://C:/ as well)*/
						

Parenthesized Expressions

Parenthesized Expressions are used to vary the standard left-to-right evaluation order of evaluation of Combinatorial Expressions. Parenthesized Expressions can be nested, the deepest level will be evaluated first, working outwards. Equally nested Parenthesized Expressions revert to the standard left-to-right evaluation order.

[74] ParenthesizedExpr ::= "(" ColumnValidationExpr+ ")"

Conditional Expressions

A Conditional Expression is used to apply different Column Validation Expressions to a column on the basis of the result of the evaluation of some other Non Conditional Expression. This is particularly useful when the data expected in one column depends on the value of another column. In the original CSV Schema Language 1.0 there was only one form of Conditional Expression, the If Expression. CSV Schema Language 1.1 introduces the Switch Expression which allows a more compact and readable form for what would otherwise be written as nested If Expressions.

[75] ConditionalExpr ::= IfExpr | SwitchExpr

If Expressions

The If Expression is the original form of Conditional Expression introduced in CSV Schema Language 1.0. It takes three input parameters: the first two of these MUST be used, first a Combinatorial Expression or Non Conditional Expression; when that evaluates to true, the second parameter, one or more Column Validation Expressions, are applied; the third parameter is OPTIONAL, these are the else Column Validation Expressions used when the first parameter evaluates to false.

[76] IfExpr ::= "if(" (CombinatorialExpr | NonConditionalExpr) "," ColumnValidationExpr+ ("," ColumnValidationExpr+)? ")"

Usage

	a_column: any("true","false")
	another_column: any("yes","no")
	third_column: if($a_column\is("true"),is("some string"),is("some other string")) /*here we look to the value of a_column, if it is "true", third_column must be "some string"
	                                                                                   otherwise third_column must be "some other string"*/
	fourth_column: if(($a_column\is("true") and $another_column\is("yes")),is("some string"),is("some other string"))
	//in fourth_column we use a more complicated combinatorial expression as the initial test, if both a_column is "true" and another_column is "yes", 
	//then fourth_column is "some string", otherwise fourth_column is "some other string".  All column expressions could be used for the test etc, only is is used for simpilcity
						

Switch Expressions

This is a new expression in CSV Schema Language 1.1

The Switch Expression generalises the If Expression. It comprises at least one Switch Case Expression followed by a final OPTIONAL parameter, one or more Column Validation Expressions, (this is the else expression) applied when all previous test expressions have evaluated to false.

Note that evaluation is halted after the first test expression to return true.

Switch Case Expression

The Switch Case Expression takes two parameters (equivalent to the first two non-optional parameters of the If Expression), first a Combinatorial Expression or Non Conditional Expression; when that evaluates to true, the second parameter, one or more Column Validation Expressions, are applied. At least one Switch Case Expression MUST be used within a Switch Expression, but there is no limit on the maximum number used.

[77] SwitchExpr ::= "switch(" SwitchCaseExpr+ ("," ColumnValidationExpr+)? ")"
[78] SwitchCaseExpr ::= "("( CombinatorialExpr | NonConditionalExpr) "," ColumnValidationExpr+ ")"

Usage

	a_column: any("true","false","unknown")
	another_column: if($a_column\is("true"),is("some string"),if($a_column\is("false"),is("some other string"),is("some third string")))
	/*here we look to the value of a_column, if it is "true", another_column must be "some string" otherwise, check if a another_column is "false), 
	if so another_column must be "some other string", otherwise another_column is "some third string".  Nesting if statments like this can quickly get 
	very difficult to read, so instead we can use the switch statement*/
	third_column: switch(($a_column\is("true"),is("some string")),($a_column\is("false"),is("some other string")),is("some third string"))
	//this switch statement for third_column is functionally identical to the nested if statement demonstrated on another_column, and is much easier to extend
	//is there were additional options available for a_column, each bracketed pair such of test and column validation expression, 
	//such as ($a_column\is("true"),is("some string")), is called a Switch Case Expression, as many as required are then followed by a final column validation expression,
	//used if none of the Switch Case Expressions evaluate to true.
						

Column Expression examples

Additional examples for a range of Column Expressions is given below. A greater range of example and other schemas actually used by The National Archives can be found on GitHub. Most of these are extensively commented in order to explain usage. There is a also a set of example files to be downloaded which allow File Exists Expressions and Checksum Expressions and path substitutions to be more easily understood, these are designed to be used with the generic_digitised_surrogate_tech_acq_metadata_v1.1.csvs and generic_digitised_surrogate_tech_acq_metadata_v1.0.csvs schemas, which also helps demonstrate the additional checking that CSV Schema Language 1.1 enables.

	piece: is("1") and (in($file_path) and in($resource_uri))          /*The column "piece" must have the specific value 1
	                                                                   the value must also be part of the value of the columns "file_path" and "resource_uri"
	                                                                   explicit And Expression is used between each specified Column Expression*/
	item: range(1,540) unique($piece,$item)                            //this field must contain an integer between 1 and 540 inclusive.
	                                                                   the combination of piece and item must be unique within the file.
	file_uuid: uuid4 unique                                            /*must be a version 4 uuid, and the value must be unique within the file (uuids must be 
	                                                                   lower case). Here an implicit And Expression is used*/
	file_path: fileExists uri starts(concat("file:///",$piece,"/",$item))     /*fileExists checks that there is actually a file of the given name at the 
	                                                                   specified location on the file system which is assumed to be the value held in "file_path".  
	                                                                   We know the location should be in the form of a URI so a URI expression is used, 
	                                                                   and in particular this should be a file url, so we further specify that the data 
	                                                                   in the column must start "file:///", then (a folder named for) the piece id, then a /, then the item id */
	file_checksum: checksum(file($file_path),"SHA-256")                /* Compare the value given in this field to the checksum calculated for the file
	                                                                   found at the location given in the "file_path" field.
	                                                                   Use the specified checksum algorithm (SHA-256)
	                                                                   (must use lowercase hex characters). */
	image_split: is("yes") or is("no")                                 //must be string: yes; or string: no (precisely - case as shown)
	image_split_other_uuid: if($image_split/is("yes"),uuid4,is(""))    //if "image_split" field is yes, must be a uuid4, else must be blank
	image_split_operator: if($image_split/is("yes"),length(1,12) and regex("^[0-9a-zA-Z]{1,12}$"),is(""))
	                                                                   /*If "image_split" field is the string: yes (precisely)
	                                                                   then field must be 12 characters long.  This is further restricted by regex statement
	                                                                   to being only alphanumeric characters (upper and lower case). */
	image_split_timestamp: if($image_split/is("yes"),xDateTime(2013-12-04T00:00:00+00:00,2014-03-04T23:59:59+00:00),is(""))
	                                                                   /*If "image_split" field is string: yes (precisely)
	                                                                   then timestamp for image split, compliant with XSD DateTime data type
	                                                                   and in range 4 December 2013 - 4 March 2014 (from the midnight starting 4 December, 
	                                                                   to last second of 4 March), in the UTC (Greenwich Meantime) timezone, 
																	   else it must be blank (ie "image_split" is no).
                                                                       As xDateTime, rather than xDateTimeTz, is specified, 
                                                                       the use of the timezone component within the supplied date time is optional, eg both:
                                                                       2013-12-05T12:34:00+00:00 and 2013-12-05T12:34:00 would be acceptable in the metadata/*
				

Data types

Most Column Validation Expressions rely on a small number of underlying data types. Some of these are defined by means of a regular expression embedded in the EBNF as indicated by xgc:regular-expression. There are 11 data types, XSD Date Time Literal, XSD Date Literal, XSD Time Literal, UK Date Literal, Positive Non Zero Integer Literal, Positive Integer Literal, Numeric Literal, String Literal, Character Literal, Wildcard Literal and Identifier.

XSD Date and Time Types

XSD Date Time Literals

An XSD Date Time Literal MUST be a valid XML Schema dateTime data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form yyyy-mm-ddThh:mm:ss(.MMMMM) (year-month-day followed by time in hours, minutes, seconds and OPTIONAL fractions of a second). There are two OPTIONAL parts, a minus sign MAY be used for BC dates, and there MAY be a suffix indicating the applicable timezone as an offset from GMT/UTC. GMT itself may be indicated by a suffix Z for Zulu. It is represented in the EBNF by a regular expression defining precisely which characters are to be used.

[79] XsdDateTimeLiteral ::= XsdDateWithoutTimezoneComponent "T" XsdTimeLiteral

XSD Date Time With Time Zone Literals

An XSD Date Time With Time Zone Literal MUST be a valid XML Schema dateTime data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form yyyy-mm-ddThh:mm:ss(.MMMMM) (year-month-day followed by time in hours, minutes, seconds and OPTIONAL fractions of a second). There is one OPTIONAL part, a minus sign MAY be used for BC dates. However, unlike XSD Date Time Literal there MUST be a suffix indicating the applicable timezone as an offset from GMT/UTC. GMT itself may be indicated by a suffix Z for Zulu, or as +00:00. It is represented in the EBNF by a regular expression defining precisely which characters are to be used.

[80] XsdDateTimeWithTimeZoneLiteral ::= XsdDateWithoutTimezoneComponent "T" XsdTimeWithoutTimezoneComponent XsdTimezoneComponent

XSD Date Literals

An XSD Date Literal MUST be a valid XML Schema date data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form yyyy-mm-dd (year-month-day). There is one OPTIONAL part, a minus sign MAY be used for BC dates. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is also used as the date part of XSD Date Time Literal.

[81] XsdDateLiteral ::= XsdDateWithoutTimezoneComponent XsdOptionalTimezoneComponent

XSD Time Literals

An XSD Time Literal MUST be a valid XML Schema date data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form hh:mm:ss(.MMMMM) (time in hours, minutes, seconds and OPTIONAL fractions of a second). There is one OPTIONAL part, there MAY be a suffix indicating the applicable timezone as an offset from GMT/UTC. GMT itself may be indicated by a suffix Z for Zulu. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is also used as the time part of XSD Date Time Literal.

[82] XsdTimeLiteral ::= XsdTimeWithoutTimezoneComponent XsdOptionalTimezoneComponent

Common XSD Date and Time Components

The various XSD Date and Time data types from [[!XMLSCHEMA-2]] are made up from common reuseable components that are defined by regular expressions.

[83] XsdDateWithoutTimezoneComponent ::= -?[0-9]{4}-(((0(1|3|5|7|8)|1(0|2))-(0[1-9]|(1|2)[0-9]|3[0-1]))|((0(4|6|9)|11)-(0[1-9]|(1|2)[0-9]|30))|(02-(0[1-9]|(1|2)[0-9]))) /* xgc:regular-expression */
[84] XsdTimeWithoutTimezoneComponent ::= ([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(\.[0-9]{3})? /* xgc:regular-expression */
[85] XsdOptionalTimezoneComponent ::= ((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z)? /* xgc:regular-expression */
[86] XsdTimezoneComponent ::= ((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z) /* xgc:regular-expression */

UK Date Literals

A UK Date Literal is a data type representing the usual UK format for writing dates, dd/mm/yyyy. It is represented in the EBNF by a regular expression defining precisely which characters are to be used.

[87] UkDateLiteral ::= (((0[1-9]|(1|2)[0-9]|3[0-1])\/(0(1|3|5|7|8)|1(0|2)))|((0[1-9]|(1|2)[0-9]|30)\/(0(4|6|9)|11))|((0[1-9]|(1|2)[0-9])\/02))\/[0-9]{4} /* xgc:regular-expression */

Positive Non Zero Integer Literals

A Positive Non Zero Integer Literal is a data type representing the positive integer natural numbers. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is a specialisation of Positive Integer Literal.

[88] PositiveNonZeroIntegerLiteral ::= [1-9][0-9]* /* xgc:regular-expression */

Positive Integer Literals

A Positive Integer Literal is a data type representing the non-negative integer natural numbers. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is a specialisation of Numeric Literal.

[89] PositiveIntegerLiteral ::= [0-9]+ /* xgc:regular-expression */

Numeric Literals

A Numeric Literal is a data type representing any real number expressed as an integer or decimal.

[90] NumericLiteral ::= -?[0-9]+(\.[0-9]+)? /* xgc:regular-expression */

String Literals

A String Literal is zero or more characters (excluding quotation mark) encased witin quotation marks, i.e. the [[UTF-8]] character code 0x22.

[91] StringLiteral ::= "\"" [^"]* "\""

Character Literals

A Character Literal is a single non-breaking character encased within apostrophes, i.e. the [[UTF-8]] character code 0x27.

[92] CharacterLiteral ::= "'" [^\r\n\f'] "'"

Wildcard Literals

A Wildcard Literal is a single asterisk (*), i.e. the [[UTF-8]] character code 0x2A.

[93] WildcardLiteral ::= "*"

Identifiers

An Identifier is the set of characters which can be used in an ordinary Column Identifier. Upper and lower case alphabetic characters (unaccented), along with digits 0-9, the hyphen-minus (-), low line (_) and full stop (.), i.e. the [[UTF-8]] character codes 0x2D, 0x5F and 0x2E.

[94] Ident ::= /* [A-Za-z0-9\-_\.]+ */

Errors and Warnings

An implementation MUST first check that the provided CSV Schema(s) are syntactically correct. If not, a Schema Error is produced, and no further validation of the CSV Schema(s) or provided CSV files(s) SHOULD be undertaken. If the schema check is successful then an implementation MAY continue with further CSV Schema(s) and CSV file validation.

If an implementation performs validation of a CSV file against a CSV Schema, a report SHOULD be produced for each Column Validation Expression that fails validation; This is generally considered a Validation Error, unless the Warning Directive has been used to reduce the severity of an error within a specific Column Rule to a Validation Warning.

Schema Errors

A Schema Error is caused by syntax errors in the definition of the CSV Schema. These might include for example: an incorrect Version Declaration, or a mismatch between the number specified for Total Columns Directive and the actual number of Column Definitions given in the Body of the Schema. Schema Errors would also be produced by mismatched Parenthesized Expressions, unrecognised Column Validation Expressions and Explicit Context Expressions which do not match an actual Column Identifier.

An implementation MUST report a Schema Error.

Validation Errors

If column data does not validate successfully against a Column Rule, an implementation SHOULD report a Validation Error. It is implementation defined whether a Validation Error terminates execution, or whether execution continues. If execution continues, any further errors SHOULD be reported.

NOTE The Warning Directive may be used within a Column Rule to specify that what would normally be a Validation Error should be treated only as a Validation Warning.

A Validation Warning can be used when you wish to highlight unexpected values that are encountered in the data, but for some reason they are not to be considered failures of validation. For example within archival documents the date may have been recorded as an impossible date, such as 30 February or 31 April. A transcriber has correctly entered the data as seen on the original document, but yet it is not a valid date. You may wish to highlight these cases for additional QA, but it should not be considered an error.

The text/csv-schema Media Type

This Appendix specifies the media type for CSV Schema Version 1.0 and CSV Schema Version 1.1. CSV Schema is a language for describing and validating CSV files, as specified in the main body of this document. This media type has been submitted to the IESG (Internet Engineering Steering Group) for review, approval, and registration with IANA (Internet Assigned Numbers Authority.)

The text/csv-schema media type, is intended to be used for transmitting schemas written in the CSV Schema language.

File Extensions

The suggested file extension for use when naming CSV Schema files is .csvs.

CSV Schema Grammar

EBNF

The grammar of CSV Schema uses the same simple EBNF (Extended Backus-Naur Form) notation as [[!XML10]] with the following minor differences.

To increase readability, the EBNF in the main body of this document omits some of these notational features. This appendix is the normative version of the EBNF.

Link conventions used in this appendix:

Schema ::= Prolog Body
Prolog ::= VersionDecl GlobalDirectives
VersionDecl ::= ("version 1.0" | "version 1.1")
GlobalDirectives ::= SeparatorDirective? QuotedDirective? TotalColumnsDirective? PermitEmptyDirective? (NoHeaderDirective | IgnoreColumnNameCaseDirective)? /* xgc:unordered */
DirectivePrefix ::= "@"
SeparatorDirective ::= DirectivePrefix "separator" (SeparatorTabExpr | SeparatorChar)
SeparatorTabExpr ::= "TAB" | '\t'
SeparatorChar ::= CharacterLiteral
QuotedDirective ::= DirectivePrefix "quoted"
TotalColumnsDirective ::= DirectivePrefix "totalColumns" PositiveNonZeroIntegerLiteral
PermitEmptyDirective ::= DirectivePrefix "permitEmpty"
NoHeaderDirective ::= DirectivePrefix "noHeader"
IgnoreColumnNameCaseDirective ::= DirectivePrefix "ignoreColumnNameCase"
Body ::= BodyPart+
BodyPart ::= Comment* ColumnDefinition Comment*
Comment ::= SingleLineComment | MultiLineComment
SingleLineComment ::= //[\S\t ]* /* xgc:regular-expression */
MultiLineComment ::= \/\*(?:[^*\r\n]+|(?:\r?\n))*\*\/ /* xgc:regular-expression */
ColumnDefinition ::= (ColumnIdentifier | QuotedColumnIdentifier) ":" ColumnRule
ColumnIdentifier ::= PositiveNonZeroIntegerLiteral | Ident
QuotedColumnIdentifier ::= StringLiteral
ColumnRule ::= ColumnValidationExpr* ColumnDirectives
ColumnDirectives ::= OptionalDirective? MatchIsFalseDirective? IgnoreCaseDirective? WarningDirective? /* xgc:unordered */
OptionalDirective ::= DirectivePrefix "optional"
MatchIsFalseDirective ::= DirectivePrefix "matchIsFalse"
IgnoreCaseDirective ::= DirectivePrefix "ignoreCase"
WarningDirective ::= DirectivePrefix "warningDirective"
ColumnValidationExpr ::= CombinatorialExpr | NonCombinatorialExpr
CombinatorialExpr ::= OrExpr | AndExpr
OrExpr ::= NonCombinatorialExpr "or" ColumnValidationExpr
AndExpr ::= NonCombinatorialExpr "and" ColumnValidationExpr
NonCombinatorialExpr ::= NonConditionalExpr | ConditionalExpr
NonConditionalExpr ::= SingleExpr | ExternalSingleExpr | ParenthesizedExpr
SingleExpr ::= ExplicitContextExpr? (IsExpr | AnyExpr | NotExpr | InExpr | StartsWithExpr | EndsWithExpr | RegExpExpr | RangeExpr | LengthExpr | EmptyExpr | NotEmptyExpr | UniqueExpr | UriExpr | XsdDateTimeExpr | XsdDateTimeWithTimeZoneExpr | XsdDateExpr | XsdTimeExpr | UkDateExpr | DateExpr | PartialUkDateExpr | PartialDateExpr | Uuid4Expr | PositiveIntegerExpr | UpperCaseExpr | LowerCaseExpr | IdenticalExpr)
ExplicitContextExpr ::= ColumnRef "/"
ColumnRef ::= "$" (ColumnIdentifier | QuotedColumnIdentifier)
IsExpr ::= "is(" StringProvider ")"
AnyExpr ::= "any(" StringProvider ")"
NotExpr ::= "not(" StringProvider ")"
InExpr ::= "in(" StringProvider ")"
StartsWithExpr ::= "starts(" StringProvider ")"
EndsWithExpr ::= "ends(" StringProvider ")"
RegExpExpr ::= "regex(" StringLiteral ")"
RangeExpr ::= "range(" (NumericOrAny "," NumericLiteral | NumericLiteral "," NumericOrAny) ")" /* range is inclusive */
NumericOrAny ::= NumericLiteral | WildcardLiteral
LengthExpr ::= "length(" (PositiveIntegerOrAny ",")? PositiveIntegerOrAny ")"
PositiveIntegerOrAny ::= PositiveIntegerLiteral | WildcardLiteral
EmptyExpr ::= "empty"
NotEmptyExpr ::= "notEmpty"
UniqueExpr ::= "unique" ("(" ColumnRef ("," ColumnRef)* ")")?
UriExpr ::= "uri"
XsdDateTimeExpr ::= "xDateTime" ("(" XsdDateTimeLiteral "," XsdDateTimeLiteral ")")?
XsdDateTimeWithTimeZoneExpr ::= "xDateTimeTz" ("(" XsdDateTimeWithTimeZoneLiteral "," XsdDateTimeWithTimeZoneLiteral ")")?
XsdDateExpr ::= "xDate" ("(" XsdDateLiteral "," XsdDateLiteral ")")?
XsdTimeExpr ::= "xTime" ("(" XsdTimeLiteral "," XsdTimeLiteral ")")?
UkDateExpr ::= "ukDate" ("(" UkDateLiteral "," UkDateLiteral ")")?
DateExpr ::= "date(" StringProvider "," StringProvider "," StringProvider ("," XsdDateLiteral "," XsdDateLiteral)? ")"
PartialUkDateExpr ::= "partUkDate"
PartialDateExpr ::= "partDate(" StringProvider "," StringProvider "," StringProvider ")"
Uuid4Expr ::= "uuid4"
PositiveIntegerExpr ::= "positiveInteger"
UpperCaseExpr ::= "upperCase"
LowerCaseExpr ::= "lowerCase"
IdenticalExpr ::= "positiveInteger"
ExternalSingleExpr ::= ExplicitContextExpr? (FileExistsExpr | IntegrityCheckExpr | ChecksumExpr | FileCountExpr)
FileExistsExpr ::= "fileExists" ("(" StringProvider ")")?
IntegrityCheckExpr ::= "integrityCheck" "(" (StringProvider ",")? (StringProvider ",")? ("\"includeFolder\"" | "\"excludeFolder\"") ")"
ChecksumExpr ::= "checksum(" FileExpr "," StringLiteral ")"
FileExpr ::= "file(" (StringProvider ",")? StringProvider ")"
FileCountExpr ::= "fileCount(" FileExpr ")"
StringProvider ::= ColumnRef | StringLiteral | ConcatExpr | NoExtExpr
ConcatExpr ::= "concat(" StringProvider ("," StringProvider)+ ")"
NoExtExpr ::= "noExt(" StringProvider ")"
ParenthesizedExpr ::= "(" ColumnValidationExpr+ ")"
ConditionalExpr ::= IfExpr | SwitchExpr
IfExpr ::= "if(" (CombinatorialExpr | NonConditionalExpr) "," ColumnValidationExpr+ ("," ColumnValidationExpr+)? ")"
SwitchExpr ::= "switch(" SwitchCaseExpr+ ("," ColumnValidationExpr+)? ")"
SwitchCaseExpr ::= "if(" (CombinatorialExpr | NonConditionalExpr) "," ColumnValidationExpr+ ")"

Lexical (Terminal Symbols)

XsdDateTimeLiteral ::= XsdDateWithoutTimezoneComponent "T" XsdTimeLiteral
XsdDateTimeWithTimeZoneLiteral ::= XsdDateWithoutTimezoneComponent "T" XsdTimeWithoutTimezoneComponent XsdTimezoneComponent
XsdDateLiteral ::= XsdDateWithoutTimezoneComponent XsdOptionalTimezoneComponent
XsdTimeLiteral ::= XsdTimeWithoutTimezoneComponent XsdTimezoneComponent
XsdDateWithoutTimezoneComponent ::= -?[0-9]{4}-(((0(1|3|5|7|8)|1(0|2))-(0[1-9]|(1|2)[0-9]|3[0-1]))|((0(4|6|9)|11)-(0[1-9]|(1|2)[0-9]|30))|(02-(0[1-9]|(1|2)[0-9]))) /* xgc:regular-expression */
XsdTimeWithoutTimezoneComponent ::= ([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(\.[0-9]{3})? /* xgc:regular-expression */
XsdOptionalTimezoneComponent ::= ((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z)? /* xgc:regular-expression */
XsdTimezoneComponent ::= ((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z) /* xgc:regular-expression */
UkDateLiteral ::= (((0[1-9]|(1|2)[0-9]|3[0-1])\/(0(1|3|5|7|8)|1(0|2)))|((0[1-9]|(1|2)[0-9]|30)\/(0(4|6|9)|11))|((0[1-9]|(1|2)[0-9])\/02))\/[0-9]{4} /* xgc:regular-expression */
PositiveNonZeroIntegerLiteral ::= [1-9][0-9]* /* xgc:regular-expression */ /* A Natural Number, positive integer */
PositiveIntegerLiteral ::= [0-9]+ /* xgc:regular-expression */ /* A Natural Number, non-negative integer */
NumericLiteral ::= -?[0-9]+(\.[0-9]+)? /* xgc:regular-expression */ /* A Real Number, expressed as an integer or decimal */
StringLiteral ::= "\"" [^"]* "\"" /* xgc:regular-expression */ /* Any characters except: quotation mark */
CharacterLiteral ::= "'" [^\r\n\f'] "'" /* xgc:regular-expression */ /* Any characters except: carriage-return, line-break, form-feed and apostrophe */
WildcardLiteral ::= "*"
Ident ::= [A-Za-z0-9\-_\.]+ /* xgc:regular-expression */

Extra-grammatical Constraints

xgc:regular-expression

The right-hand side of the expression is expressed using a Regular Expression. The Regular Expression syntax used is that from Java's Pattern class.

xgc:unordered

Implies that each distinct symbol in the expression may appear in any order.

Acknowledgements

Many thanks to: