The intention of this document is two-fold:
The National Archives DRI (Digital Repository Infrastructure) system archives digitised and born-digital materials provided by OGDs (Other Government Departments) and occasionally NGOs (Non-Governmental Organisations). For the purposes of Digital Preservation the system processes and archives large amounts of metadata, much of this metadata is created by the supplying organisation or by transcription. The metadata is further processed, and ultimately stored both online in an RDF Triplestore and a majority subset archived in a non-RDF XML format. However it was recognised that the creation of XML or RDF metadata by the supplier was most likely unrealistic for either technical or financial reasons. As such, CSV was recognised as a simple data format that is human readable (to a degree), that almost anyone could create simply; CSV is the lowest common denominator of structured data formats.
The National Archives have strict rules about various CSV file formats that they expect, and how the data in those file formats should be set out. To ensure the quality of their archival metadata it was recognised that CSV files would have to be validated. It was recognised that development of a schema language for CSV (and associated tools) would be of great benefit. It was also further recognised that a general CSV Schema language would be of greater benefit if it was made publicly available and invited collaboration from other organisations and individuals; the problem of CSV data formats is certainly not unique to The National Archives.
CSV Schema is a standard currently guided by The National Archives, but developed in an open source collaborative manner that invites collaboration and contributions from all interested parties.
A reference implementation has been created to prove the standard: The open source CSV Validator application and API, offers both CSV Schema parsing and CSV file validation.
The design of the CSV Schema language has been influenced by a few guiding principles, understanding these will help you to understand how and why it is structured the way that it is.
The language should be expressible in plain text and should be simple enough that non-technical domain experts could easily write it without having to know a programming language or data/document modelling language such as XML, JSON or RDF.
Note, the CSV Schema Language is NOT itself expressed in CSV, it is expressed in a simple text format.
A schema rule is written for each column of the CSV file. Each set of column rules are asserted against each row of the CSV file. Each rule in the CSV Schema operates on the current context (e.g. defined Column and parsed Row), unless otherwise specified. This makes the rules short and concise.
CSV files may be very large and so the CSV Schema Language was designed with concern for implementations, that although not required by the specification, MAY wish to read and process CSV data as a stream. Few operations require mnemonization of data from the CSV file, and where they do this is limited and should be optimisable to keep memory use to a minimum.
We try to do the right thing by default. CSV files and their brethren (Tab Separated Values etc.) can come in many shapes and sizes, by default we assume the CSV data format will comply with [[RFC4180]], of course we allow you to customize this behaviour in the CSV Schema.
This is worth stressing as it was something we had to keep sight of ourselves during development; CSV Schema is a simple data definition and validation language for CSV!
A CSV Schema is really a rules based language which defines how data in each cell should be formatted. Rules are expressed per-column of the CSV data. Rules are evaluated for each row in the CSV data. A column rule may express constraints based on the content of other columns in the same row, however at present there is no scope for looking forward or backward through rows directly. However, it is possible to check that a cell entry is unique within that column in the CSV file (or that the value of a combination of cells is unique)
A CSV Schema is made up of two main parts:
In turn this comprises (at most) two sections (the second being OPTIONAL):
The CSV Schema MUST explicitly state (as its first non-comment line) the version of the CSV Schema language that it uses. This is to allow for future evolution of the CSV Schema language to be easily handled by CSV Schema processors.
Global Directives apply to all processing of the CSV data. Global Directives for example allow you to define the separator sequence between columns in the CSV data. Global Directives appear before Column Rules and are prefixed with an @
character.
The use of Global Directives is OPTIONAL, default values are used if they are not specified.
The Body of the CSV Schema MUST declare, in order, a Column Rule for each Column in the CSV data. If validation of a Column is not desirable, then an empty rule is used.
Let's now illustrate a simple CSV Schema that is concerned with CSV data about names, ages and gender:
version 1.1 @totalColumns 3 name: notEmpty age: range(0, 120) gender: is("m") or is("f") or is("t") or is("n")This CSV Schema basically defines that the CSV data must have 3 columns: the first column, name, must have some sort of value; the second column, age, must be a number between 0 and 120 inclusive; and the third column, gender, must be one of the characters m, f, t or n. An example of CSV data that would match the rule definitions in the CSV schema could be as follows:
name,age,gender james,21,m lauren,19,f simon,57,m
An example of CSV data would fail the rule definitions in the CSV schema could be as follows:
name,age,gender james,4 years,m lauren,19,f simon,57,male
The Invalid CSV Data example above fails when validated against the CSV Schema because: 1) at row 2 column 2, "4 years" is not a number between 1 and 120 inclusive, and 2) at row 4 column 3, "male" is not one of the characters m, f, t, or n.
The last 18 months with CSV Schema Language 1.0 being in regular use at The National Archives has highlighted a few additional Column Validation Expressions that would provide further useful validation, simplify schema writing, or make schemas more readable. In addition the concept of a String Provider has been extended to allow concatenation to produce a final string input to expressions from some set of other String Providers, and also a function to allow removal of a Windows file extension to make certain comparisons more straightforward and robust.
The new Column Validation Expressions are the:
The CSV schema language is formally a context-free grammar expressed in Extended Backhaus-Naur Form (EBNF - see also [[RFC5234]])
The following subsections examine the structure of a CSV Schema in more detail. Each subsection comprises definitions of terms, cross-references to other definitions, the relevant portion of the EBNF (links on the lefthandside go to the appendix containing the full EBNF, those on the right to a fuller explanation of those term(s)), and examples of correct usage.
A Schema MUST comprise both Prolog and Body.
[1] | Schema | ::= | Prolog Body |
The Prolog of a CSV Schema MUST contain the Version Declaration and MAY contain one or more Global Directives.
[2] | Prolog | ::= | VersionDecl GlobalDirectives |
The Version Declaration declares explicitly which version of the CSV Schema language is in use.
This MUST be either version 1.0
or version 1.1
.
If the version is not valid this is considered a Schema Error.
If the version is declared as 1.0 but the CSV Schema attempts to use features of 1.1 this is also considered a Schema Error.
The Version Declaration is MANDATORY.
[3] | VersionDecl | ::= | ("version 1.0" | "version 1.1") |
version 1.0
The Global Directives allow you to modify the overall processing of a CSV file or how subsequent Column Definitions are evaluated.
The use of Global Directives within a CSV Schema is OPTIONAL.
The last two Global Directives described (No Header Directive and Ignore Column Name Case Directive) are mutually exclusive,
they MUST NOT both be used in a single schema.
There is no inherent reason why the Global Directives should be in the order shown, EBNF does not directly cater for unordered lists.
You could explicitly list each possible ordering, but that would require 4!=24 orderings to be included in the ENBF.
All directives (both Global Directives and Column Directives) used in the CSV Schema are indicated by the Directive Prefix,
defined as the character @
i.e. the [[UTF-8]] character code 0x40
.
Whitespace is not generally significant, so Global Directives can be entered all on a single line, or each on separate lines (see ).
[4] | GlobalDirectives | ::= | SeparatorDirective? QuotedDirective? TotalColumnsDirective? PermitEmptyDirective? (NoHeaderDirective | IgnoreColumnNameCaseDirective)? | /* xgc:unordered */ |
[5] | DirectivePrefix | ::= | "@" |
The Separator Directive allows you to specify the separator character that is used between columns in the CSV data.
As with all Global Directives the Separator Directive is OPTIONAL, if not supplied the default value is assumed.
By default the separator is a comma (,)
i.e. the [[UTF-8]] character code 0x2c
(as specified in [[RFC4180]]).
The Separator Directive takes a MANDATORY parameter in the form of either a Separator Tab Expression or a Separator Character.
A Separator Tab Expression indicates that the separator comprises a 0x09
.
A Separator Character is a Character Literal: the character which is to be treated as the column separator.
[6] | SeparatorDirective | ::= | DirectivePrefix "separator" (SeparatorTabExpr | SeparatorChar) |
[7] | SeparatorTabExpr | ::= | "TAB" | '\t' |
[8] | SeparatorChar | ::= | CharacterLiteral |
The Quoted Directive allows you to specify whether or not all columns are quoted.
That is to say that their values are encased in quotation mark characters, i.e. [[UTF-8]] character code 0x22
.
In practice most CSV libraries are able to detect and handle the presence or absence of quotes and handle it appropriately,
but implementations of this schema language should be able to decide how to handle this situation.
[9] | QuotedDirective | ::= | DirectivePrefix "quoted" |
The Total Columns Directive allows you to specify the total number of data columns expected to make up each row of the CSV file. The parser will also verify that the Body of the CSV Schema contains the same number of Column Rules, a mismatch is considered a Schema Error. The Total Columns Directive is OPTIONAL: when this directive is not used this verification of the number of Column Rules cannot be performed, and it will be assumed that you have supplied the correct number of Column Rules.
[10] | TotalColumnsDirective | ::= | DirectivePrefix "totalColumns" PositiveNonZeroIntegerLiteral |
This is a new expression in CSV Schema Language 1.1
The Permit Empty Directive allows you to specify that the CSV file can be empty: i.e. there is no row data. The Permit Empty Directive is OPTIONAL: when not present an empty file will cause a validation error. The Permit Empty Directive can be used in conjunction with the No Header Directive thereby permitting a completely empty CSV file. If the No Header Directive is not present then a minimum of one row containing column names must be provided.
[11] | PermitEmptyDirective | ::= | DirectivePrefix "permitEmpty" |
The No Header Directive is used to indicate that the CSV file to be validated does not contain a header row: i.e. all rows are data rows. The No Header Directive is OPTIONAL: when this directive is not used the parser assumes by default that the first row of the CSV file to be validated contains column names, not data, and so the first row is skipped during validation.
The No Header Directive is mutually exclusive to the use of the Ignore Column Name Case Directive, when one is used, the other MUST NOT be.
[12] | NoHeaderDirective | ::= | DirectivePrefix "noHeader" |
The Ignore Column Name Case Directive is intended to tell the parser to ignore mismatches in case between the Column Identifiers supplied in a CSV file to be validated and those used in giving the Column Definitions in the schema.
The Ignore Column Name Case Directive is mutually exclusive to the use of the No Header Directive, when one is used, the other MUST NOT be.
[13] | IgnoreColumnNameCaseDirective | ::= | DirectivePrefix "ignoreColumnNameCase" |
@separator ';' @quoted @totalColumns 21 @noHeader
@separator TAB @quoted @totalColumns 21 @permitEmpty @ignoreColumnNameCase
The Body of a CSV Schema comprises at least one Body Part, each of which is a combination of OPTIONAL Comments with a Column Definition (in either order). A Column Definition MUST be included.
[14] | Body | ::= | BodyPart+ |
[15] | BodyPart | ::= | Comment* ColumnDefinition Comment* |
There are two types of Comment: either Single Line or Multiple Line.
A Single Line Comment is started with two forward slashes (//)
, i.e. the [[UTF-8]] character codes 0x2F 0x2F
.
It is terminated by any [[UTF-8]] character that creates a line-break.
A Multiple Line Comment is started using the combination of a forward slash (/)
and an asterisk (*)
,
i.e. the [[UTF-8]] character codes 0x2F 0x2A
.
It is terminated by the reverse combination, asterisk (*)
forward slash (/)
,
i.e. the [[UTF-8]] character codes 0x2A 0x2F
.
Any [[UTF-8]] character except asterisk may be used between these comment markers, even if it forces a new line.
Comments do not need to start at the beginning of a line, but must be either before or after a complete Column Definition or another Comment.
[16] | Comment | ::= | SingleLineComment | MultiLineComment | |
[17] | SingleLineComment | ::= | //[\S\t ]* | /* xgc:regular-expression */ |
[18] | MultiLineComment | ::= | \/\*(?:[^*\r\n]+|(?:\r?\n))*\*\/ | /* xgc:regular-expression */ |
//This Comment is a Single Line Comment it terminates at this line break /*This Comment is a Multi Line Comment: it can go on for as many lines as you like, until you type*/
Column Definitions comprise a Column Identifier or Quoted Column Identifier followed by a colon (:)
,
i.e. the [[UTF-8]] character code 0x3A
, followed by a Column Rule.
There MUST be a Column Definition for every column in the CSV that will be validated against the Schema, however the Column Rule can be left empty if no validation is needed for a specific column.
[19] | ColumnDefinition | ::= | (ColumnIdentifier | QuotedColumnIdentifier) ":" ColumnRule |
There are two classes of identifier that can be used for columns, the original simple Column Identifier, and the Quoted Column Identifier.
A Column Identifier is either a Positive Non Zero Integer Literal (most commonly used when the CSV file to be validated has no header row - see No Header Directive) which indicates the offset of the column (starting from 1), or an Ident.
The Quoted Column Identifier allows a greater range of characters to be used in naming the column than can be supported by an Ident,
but the identifier MUST be wrapped in quotation marks (")
,
i.e. the [[UTF-8]] character code 0x22
(this is implicit in its definition as a String Literal).
Identifiers MUST be unique within a single Schema.
[20] | ColumnIdentifier | ::= | PositiveNonZeroIntegerLiteral | Ident |
[21] | QuotedColumnIdentifier | ::= | StringLiteral |
a_column_identifier "a quoted column identifier"
A Column Rule is a combination of any number of Column Validation Expressions, along with OPTIONAL Column Directives. You MAY use an empty Column Rule if there is no requirement for an individual column to be validated.
As Column Validation Expressions are the primary means of applying validation, they are described in their own full section of this document. The range and variety of expressions available make supplying comprehensive examples here impractical, though some will be used to show the basic structure of a Column Rule.
White space is not generally important within a Column Rule, but the whole rule must be on a single line.
[22] | ColumnRule | ::= | ColumnValidationExpr* ColumnDirectives |
There are four OPTIONAL Column Directives that are used
to modify aspects of how the Column Rules are evaluated.
Like Global Directives, Column Directives are indicated by the Directive Prefix,
defined as the character @
i.e. the [[UTF-8]] character code 0x40
.
The Column Directives are the Optional Directive, the Match Is False Directive, the Ignore Case Directive, and the Warning Directive. The column directives may be specified in any order (though there is no straightforward way to express this in EBNF without listing every possible order).
[23] | ColumnDirectives | ::= | OptionalDirective? MatchIsFalseDirective? IgnoreCaseDirective? WarningDirective | /* xgc:unordered */ |
The Optional Directive is used when completion of the data field in the original CSV file to be validated is OPTIONAL. When this directive is used the data in the column is considered valid if the Column Rule evaluates to true, or if the column is empty.
[24] | OptionalDirective | ::= | DirectivePrefix "optional" |
The Match Is False Directive is used to flip the result of a test from negative to positive (or vice-versa). It may be very simple to write a condition which matches the data considered to be invalid, while the equivalent for valid data would be very convoluted.
[25] | MatchIsFalseDirective | ::= | DirectivePrefix "matchIsFalse" |
The Ignore Case Directive is used when the case of a column value is not important. Two strings which differ only in the case used for characters within the string would be considered a match for all string related column rules.
[26] | IgnoreCaseDirective | ::= | DirectivePrefix "ignoreCase" |
The Warning Directive is used to convert a Validation Error into a Validation Warning. This is useful if you wish to be alerted to a data condition which is unusual, but not necessarily invalid. For instance, at The National Archives we have come across archival material where the clerk who originally completed a form wrote down an "impossible" date such as 30 February or 31 April. We have to do our best to accept the data as originally supplied (we have no idea if it is the day or month of the date which is actually incorrect), but we also wish to ensure that additional Quality Assurance checking is performed to ensure this is not a transcription error. Warnings are listed in the validation report, but the data file MAY still considered valid if only warnings are present.
[27] | WarningDirective | ::= | DirectivePrefix "warning" |
a_column_title: is("somedata") or is("otherdata") @optional @matchIsFalse @ignoreCase @warning another_column_title: not("somedata") and not("otherdata") @ignoreCase
The two Column Definitions are both validating the data in their respective columns against the explicit strings somedata
and otherdata
.
Ignoring the Column Directives for the moment, the column rule defined for a_column_title
would return true if the CSV data for that column
contained either the precise string somedata
or otherdata
. However, the Optional Directive means a completely empty column
would also be acceptable. Also, since the Ignore Case Directive is also applied, the strings SomeData
or OTHERDATA
(for example)
would also be acceptable. But, since the Match Is False Directive is in effect, the validation is inverted, it would actually be any string other
than the two specified which would be regarded as acceptable data. Since the Warning Directive is also used, a validation failure would not be considered
an error though.
The second Column Definition (with the effect of the Match Is False Directive on the first taken into account) is actually logically equivalent to the first (see De Morgan's Laws). However, since the Optional Directive has not been used, an empty column would not be valid data, and since the Warning Directive has also not been included, a Validation Warning would be raised instead of a Validation Error.
The key building blocks for Column Rules are Column Validation Expressions.
These are divided into two main classes, Non Conditional Expressions and Conditional Expressions.
Non Conditional Expressions boil down to checks resulting in a pass or fail (a number of expressions may be combined to produce an overall validation check), Conditional Expressions allow for more subtle checks,
if for example you are validating a title
column which allows the values Mr
, Mrs
, Ms
, Miss
and Dr
,
You could construct a Conditional Expression which also checks the sex
column and if that contains female
,
then Mr
would be regarded as invalid (strictly speaking that would also require the use of an Explicit Context Expression to refer to the other column,
but that is a subexpression of the Non Conditional Expression class).
NOTE To increase control over expression applicability and to avoid creating a left-recursive grammar (which could lead to problems for various parser implementations), Column Validation Expressions have been further split into Combinatorial Expressions and Non Combinatorial Expressions.
[28] | ColumnValidationExpr | ::= | CombinatorialExpr | NonCombinatorialExpr |
A Combinatorial Expression combines one or more Column Validation Expressions, which allows more complicated tests on the validity of data in a column. There are two types, Or Expressions and And Expressions. They are of equal precedence, and evaluation of Column Validation Expressions is performed from left-to-right. See also the Any Expression which is logically equivalent to a series of Is Expressions joined by Or Expressions.
[29] | CombinatorialExpr | ::= | OrExpr | AndExpr |
An Or Expression is used as a standard boolean operator to indicate that the column data should be treated as being valid if either (or both) the expressions linked by the Or Expression evaluate to true.
[30] | OrExpr | ::= | NonCombinatorialExpr "or" ColumnValidationExpr |
An And Expression is used as a standard boolean operator to indicate that the column data should be treated as being valid when both the expressions linked by the And Expression evaluate to true. Use of an explicit And Expression is OPTIONAL: if two Column Validation Expressions are written in succession for the same column they will be treated as having an implicit And Expression joining them.
[31] | AndExpr | ::= | NonCombinatorialExpr "and" ColumnValidationExpr |
A Non Combinatorial Expression is a Column Validation Expression which is evaluated by itself, unless it is combined with another through a Combinatorial Expression. The majority of Column Validation Expressions are of the Non-Combinatorial Expression class.
[32] | NonCombinatorialExpr | ::= | NonConditionalExpr | ConditionalExpr |
Non Conditional Expressions are divided into three classes of sub-expressions: Single Expressions, External Single Expressions, and Parenthesized Expressions. The first two are individual validation checks (differing in that the second allows access to some resource outside the CSV file being validated), whilst the last provides a mechanism for controlling the evaluation order of complex compound expressions.
[33] | NonConditionalExpr | ::= | SingleExpr | ExternalSingleExpr | ParenthesizedExpr |
Single Expressions are the basic building blocks of Column Rules. There are currently 27 available for use as of CSV Schema Language 1.1 (and some have their own subexpressions used as parameters), although the first is really used as an OPTIONAL modifier to the rest. In many cases values can be provided to the test either as an explicit string (or number where appropriate), or by reference to the value held by another column.
[34] | SingleExpr | ::= | ExplicitContextExpr? (IsExpr | AnyExpr | NotExpr | InExpr | StartsWithExpr | EndsWithExpr | RegExpExpr | RangeExpr | LengthExpr | EmptyExpr | NotEmptyExpr | UniqueExpr | UriExpr | XsdDateTimeExpr | XsdDateTimeWithTimeZoneExpr | XsdDateExpr | XsdTimeExpr | UkDateExpr | DateExpr | PartialUkDateExpr | PartialDateExpr | Uuid4Expr | PositiveIntegerExpr | UpperCaseExpr | LowerCaseExpr | IdenticalExpr) |
The Explicit Context Expression is used to indicate that the expression following should be tested against the value in a foreign column (explicit context), rather than the current column (which is the default context).
[35] | ExplicitContextExpr | ::= | ColumnRef "/" |
[36] | ColumnRef | ::= | "$" (ColumnIdentifier | QuotedColumnIdentifier) |
a_column: is("some string") and $another_column/starts("some string") //here two tests are combined on a single line, the second test here looks to the second column another_column: //to check it's value starts with "some string"
An Is Expression checks that the value of the column is identical to the supplied string or the value in the referenced column.
[37] | IsExpr | ::= | "is(" StringProvider ")" |
a_column: is("some string") //the contents of a_column must be the string "some string" another_column: is($a_column) //the contents of another_column must be the value held in a_column, treated as a string
This is a new expression in CSV Schema Language 1.1
An Any Expression checks that the value of the column is identical to one of the supplied strings or the values in the referenced columns. This is logically equivalent to a series of Is Expressions joined by Or Expressions, but slightly more compact to write and maintain.
[38] | AnyExpr | ::= | "any(" StringProvider ("," StringProvider)* ")" |
a_column: is("some string") another_column: any("some other string",$a_column,"another string") //any of the string values given here are valid, including referencing the string held in a_column
A Not Expression checks that the value of the column is not equal to the supplied string or the value in the referenced column.
[39] | NotExpr | ::= | "not(" StringProvider ")" |
a_column: not("some string") //the value of a_column must not be the string "some string" another_column: not($a_column) //the value of another_column must not be the value held in a_column
An In Expression checks that the value of the column contains the supplied string or the value in the referenced column (i.e. the column value is a super string of the supplied value).
[40] | InExpr | ::= | "in(" StringProvider ")" |
a_column: in("some string") //the value of a_column must be a substring of "some string" eg "some" or "string" or "me st" etc another_column: in($a_column) //the value of another_column must be a substring of the contents of a_column
A Starts With Expression checks that the value of the column starts with the supplied string or the value in the referenced column.
[41] | StartsWithExpr | ::= | "starts(" StringProvider ")" |
a_column: starts("some string") //the value of a_column must start with the string "some string" eg "some strings" or "some string is here that's really long" another_column: starts($a_column) //the value of another_column must start with the contents of a_column
An Ends With Expression checks that the value of the column ends with the supplied string or the value in the referenced column.
[42] | EndsWithExpr | ::= | "ends(" StringProvider ")" |
a_column: ends("some string") //the value of a_column must end with the string "some string" eg "here is some string" or "this really long string ends with some string" another_column: ends($a_column) //the value of another_column must end with the contents of a_column
A Regular Expression Expression checks the value of the column against the supplied Regular Expression.
Whilst obviously many of the the other Column Validation Expressions could be written as Regular Expressions, it is felt that that would make the resulting Schema harder to read, and much harder to write for less technical users. As such it is recommended, that if a Column Rule can be written without regular expressions, by instead using other Column Validation Expressions, then that approach should be attempted first.
The Regular Expression syntax used is that from Java's Pattern class.
[43] | RegExpExpr | ::= | "regex(" StringLiteral ")" |
a_column: regex("[bcm]at") //the value of a_column must match the regular expression [bcm]at ie a string containing "bat", "cat" or "mat" another_column: regex("[0-5]") //the value of another_column match the regular expression [0-5] ie a string containing only the digits only 0-5.
The definition of this expression in CSV Schema Language 1.1 extends the definition originally made in CSV Schema Language 1.0
A Range Expression checks that the value of the column is a number lying between, or equal to, the supplied upper and lower bounds. One or other bounding value MAY be left unrestricted (by using the Wildcard Literal), so that the expression can also be used to check that a column is at least some value, or at most some value. One bounding value is defined as a Numeric Or Any expression which accepts a Numeric Literal (a Real Number expressed as a decimal), or the Wildcard Literal, while the other is a Numeric Literal only. Therefore at least one of the bounding values MUST be a Numeric Literal rather than a Wildcard Literal, so valid Range Expressions SHALL define ranges of:
range(10, *)
states that the column value MUST be at least 10range(*, 10)
ensures that the column value MUST be at most 10 (and may be negative).range(4, 7)
ensures that the column value MUST be between 4 and 7 (inclusive).[44] | RangeExpr | ::= | "range(" (NumericOrAny "," NumericLiteral | NumericLiteral "," NumericOrAny) ")" |
[45] | NumericOrAny | ::= | NumericLiteral | WildcardLiteral |
A Length Expression checks that the number of characters in the column meets the supplied definition. You can define the length in one of four ways:
length(10)
ensures that the length is exactly 10 characters.length(10, *)
ensures that the length is at least 10 or more characterslength(*, 10)
ensures that the length is at most 10 characters.length(4, 7)
ensures that the length is at least 4 characters and at most 7 characters.In order to do this the expression takes two input parameters, the first is defined as OPTIONAL, the second MUST be supplied. Both take the form of a Positive Integer Or Any expression, which is either a Positive Integer Literal (which actually includes zero), or a Wildcard Literal.
[46] | LengthExpr | ::= | "length(" (PositiveIntegerOrAny ",")? PositiveIntegerOrAny ")" |
[47] | PositiveIntegerOrAny | ::= | PositiveIntegerLiteral | WildcardLiteral |
An Empty Expression checks that the column has no content.
[48] | EmptyExpr | ::= | "empty" |
a_column: empty //there must be no value in a_column
A Not Empty Expression checks that the column has some content, though precisely what does not matter.
[49] | NotEmptyExpr | ::= | "notEmpty" |
a_column: notEmpty //there must be some value in a_column, but it can be absolutely anything
A Unique Expression checks that the column value is unique within the CSV file being validated (within the current column, the value may occur elsewhere in the file in another column, as in a primary-foreign key relationship in a database). You can also specify a comma separated list of Column References in which case the combination of values of those columns (for the current row) must be unique within the whole CSV file.
[50] | UniqueExpr | ::= | "unique" ("(" ColumnRef ("," ColumnRef)* ")")? |
a_column: unique //a_column must hold a unique value in each row of the CSV file another_column: unique($a_column,$another_column) //the combination of values in the two columns must be unique looking across each row of the CSV file
A URI Expression means that the value in the column MSUT be a valid URI as defined in [[!RFC3986]].
[51] | UriExpr | ::= | "uri" |
a_column: uri //the value of a_column must be a valid URI
An XSD Date Time Expression checks that the data in the column is expressed as a valid XML Schema dateTime data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]). You can also provide an OPTIONAL from and to date-times (inclusive) to ensure that the value in the column falls within an expected date-time range.
As the XSD Date Time Expression uses the XSD Date Time Literal the final, timezone, part of the [[!ISO8601]] definition is OPTIONAL.
[52] | XsdDateTimeExpr | ::= | "xDateTime" ("(" XsdDateTimeLiteral "," XsdDateTimeLiteral ")")? |
a_column: xDateTime //the value of a_column must be a valid xDateTime another_column: xDateTime(2014-10-04T00:00:01Z,2015-12-03T23:59:59) //the value of another_column must be a valid xDateTime and between the two xDateTimes shown (inclusive) //as shown, the xDateTime values may, or may not, have a component indicating a specific timezone, here Z (Zulu) for UTC (Greenwich Mean Time)
An XSD Date Time With Time Zone Expression is identical to XSD Date Time Expression except that it uses XSD Date Time With Time Zone Literal rather than XSD Time Literal, which means that the time zone component in [[!XMLSCHEMA-2]] and [[!ISO8601]] MUST be used. Again, you can also provide an OPTIONAL from and to date-times, (inclusive) to ensure that the value in the column falls within an expected date-time range. These are also defined as XSD Date Time With Time Zone Literal so MUST have the time zone component.
[53] | XsdDateTimeWithTimeZoneExpr | ::= | "xDateTimeTz" ("(" XsdDateTimeWithTimeZoneLiteral "," XsdDateTimeWithTimeZoneLiteral ")")? |
a_column: xDateTime //the value of a_column must be a valid xDateTime another_column: xDateTime(2014-10-04T00:00:01+02:00,2015-12-03T23:59:59+02:00) //the value of another_column must be a valid xDateTime and between the two xDateTimes shown (inclusive) //now the time zone component (+02:00) must be included
An XSD Date Expression checks that the data in the column is expressed as a valid XML Schema date data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]). You can also provide OPTIONAL from and to dates (inclusive) to ensure that the value in the column falls within an expected date range.
[54] | XsdDateExpr | ::= | "xDate" ("(" XsdDateLiteral "," XsdDateLiteral ")")? |
a_column: xDate //the value of a_column must be a valid xDate another_column: xDateTime(2014-10-04,2015-12-03) //the value of another_column must be a valid xDate and between the two xDates shown (inclusive)
An XSD Time Expression checks that the data in the column is expressed as a valid XML Schema time data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]). You can also provide OPTIONAL from and to times (inclusive) to ensure that the value in the column falls within an expected time range.
[55] | XsdTimeExpr | ::= | "xTime" ("(" XsdTimeLiteral "," XsdTimeLiteral ")")? |
a_column: xTime //the value of a_column must be a valid xTime another_column: xTime(00:00:01+02:00,2015-12-03T23:59:59+02:00) //the value of another_column must be a valid xTime and between the two xTimes shown (inclusive) //the time zone component (+02:00) is optional
A UK Date Expression checks that the data in the column is expressed as a valid UK-style date: dd/mm/yyyy
, a UK Date Literal.
You can also provide OPTIONAL from and to dates (inclusive)
to ensure that the value in the column falls within an expected date range.
[56] | UkDateExpr | ::= | "ukDate" ("(" UkDateLiteral "," UkDateLiteral ")")? |
a_column: ukDate //the value of a_column must be a valid ukDate another_column: ukDate(04/10/2014,03/12/2015) //the value of another_column must be a valid ukDate and between the two ukDates shown (inclusive)
A Date Expression allows a full date to be constructed from several columns, or strings. The expression takes five arguments: you MUST supply three strings or Column References representing Year, Month and Day (although supplied as strings, these values must in fact be integers); and there are two OPTIONAL parameters to ensure the date falls in a range specified by from XSD Date Expression and to XSD Date Expression.
[57] | DateExpr | ::= | "date(" StringProvider "," StringProvider "," StringProvider ("," XsdDateLiteral "," XsdDateLiteral)? ")" |
year_column: date($year_column,$month_column,$day_column) /*the full date to be checked is made up from the values from the three columns indicated, as shown, the columns can appear in CSV file in any order, but must appear in the date expression in the order year, month, day*/ day_column: date($year_column,$month_column,$day_column,2014-10-04,2015-12-03) //this second example shows that you can also supply from and to dates as xDates, both dates are inclusive month_column:
A Partial UK Date Expression is essentially the same as a UK Date Expression,
but allows for difficulties in transcribing from original archival material by accepting a question mark (?)
,
i.e. the [[UTF-8]] character code 0x3F
in place of illegible digits in any position, or an asterisk (*)
,
i.e. the [[UTF-8]] character code 0x2A
in place of a missing value. The names of months may also be supplied as the full name in English, i.e.:
January, February, March, April, May, June, July, September, October, November, December.
As dates may not be complete, it is impossible to determine reliably if a date falls within a given range, so there is no option to supply one.
[58] | PartialUkDateExpr | ::= | "partUkDate" |
a_column: partUkDate //the value of a_column must be a valid ukDate, but may also include characters ? and * to represent illegible or missing characters, and month names in full
A Partial Date Expression combines elements of Partial UK Date Expression with those of Date Expression, with the date being made up of columns or strings as in Date Expression, but also allowing the characters representing uncertainty as in Partial UK Date Expression. However, the constituent parts of the date MUST be supplied as Year, Month, Day.
[59] | PartialDateExpr | ::= | "partDate(" StringProvider "," StringProvider "," StringProvider ")" |
year_column: date($year_column,$month_column,$day_column) //the full date to be checked is made up from the values from the three columns indicated, day_column: //as shown, the columns can appear in CSV file in any order, but must appear in the date expression in the order year, month, day month_column: //the date may also include characters ? and * to represent illegible or missing characters
A UUID4 Expression checks that the data in the column is in the form of a Version 4 UUID (Universally Unique Identifier), see [[!RFC4122]]. UUIDs MUST use lowercase hex values.
[60] | Uuid4Expr | ::= | "uuid4" |
a_column: uuid4 //the value of a_column must be a valid version 4 UUID
A Positive Integer Expression checks that the column contains an integer value, greater than or equal to zero.
[61] | PositiveIntegerExpr | ::= | "positiveInteger" |
a_column: positiveInteger //the value of a_column must be a positive integer (including zero)
This is a new expression in CSV Schema Language 1.1
An Upper Case Expression checks that the column content is all upper case, for all code points in the [[!UTF-8]] character set which have a defined case. Non-cased code points (e.g. numeric characters, punctuation and white space) are also permitted.
In Java's Pattern class this could be expressed via the following Regular Expression: "^[\p{Lu}\p{N}\p{P}\s]*$".
[62] | UpperCaseExpr | ::= | "upperCase" |
a_column: upperCase //the contents of a_column must be all be upper case characters (Unicode aware), or uncased
This is a new expression in CSV Schema Language 1.1
A Lower Case Expression checks that the column content is all lower case, for all code points in the [[!UTF-8]] character set which have a defined case. Non-cased code points (e.g. numeric characters, punctuation and white space) are also permitted.
In Java's Pattern class this could be expressed via the following Regular Expression: "^[\p{Ll}\p{N}\p{P}\s]*$".
[63] | LowerCaseExpr | ::= | "lowerCase" |
a_column: lowerCase //the contents of a_column must be all be lower case characters (Unicode aware), or uncased
This is a new expression in CSV Schema Language 1.1
An Identical Expression asserts that the value of the column MUST be identical for every row within a CSV file, without having to specify precisely what that value will be when writing the CSV Schema.
[64] | IdenticalExpr | ::= | "identical" |
batch_date: identical xDate batch_code: identical regex("^[A-Z]{3,5}B[0-9]{3}$")
Consider a file that is expected to contain data relating to one particular batch of a process, and should also be the data relating to one particular day. Over time we will receive many such files, so we do not want to amend the schema each day to say what the valid date for the file is, or what the appropriate batch_code is. Instead we give a generic rule for the field content: it's an XSD Date Expression in the case of batch_date; or that the batch_code will comprise three to five uppercase letters, followed by an uppercase B, followed by three digits; and for each row in the file every date must be identical and every batch_code must be identical.
An External Single Expression allows access to resources outside the CSV file being validated in order to verify some information contained within the file. For example, to check that an image file referenced from within a CSV file actually exists, or that a supplied checksum matches the value calculated for a file. The available expressions are the File Exists Expression, Checksum Expression and File Count Expression. Each may be prefixed with an Explicit Context Expression in order to refer to data in a different column.
[65] | ExternalSingleExpr | ::= | ExplicitContextExpr? (FileExistsExpr | IntegrityCheckExpr | ChecksumExpr | FileCountExpr) |
A File Exists Expression checks a filesystem to see if the specified file actually exists at the specified path. It takes an OPTIONAL expression in the form of a String Provider which allows you to supply a string (or reference to a string) with a full filepath to prepend to the contents of the current column (in the case that for example it only contains just the name of the file). The default value for this string is an empty string.
See also the Integrity Check Expression which performs the inverse function of ensuring that all files in a given folder structure have been mentioned in a CSV file.
[66] | FileExistsExpr | ::= | "fileExists" ("(" StringProvider ")")? |
a_column: fileExists //the validator should check the filesystem location indicated by the contents of a_column for the existence of such a file another_column: fileExists("file:///C:/") //here the string "file:///C:/" is prepended to the contents of another_column before the existence check is made
This is a new expression in CSV Schema Language 1.1
An Integrity Check Expression checks a filesystem to see if there are any files present that are not specifically mentioned in the CSV file. It takes three expressions as parameters. The first are two OPTIONAL expressions in the form of String Providers. The first (as in the File Exists Expression) allows you to supply a string (or reference to a string) with a full filepath to prepend to the contents of the current column (in the case that for example it only contains just the name of the file). The default value for this string is an empty string. The second parameter allows a string provider to be given to point to an explicit subdirectory relative to the location of the base path. By default this subdirectory is expected to be called "content". If only a single OPTIONAL parameter is supplied, it will be assumed to be the first, so if you wish to set only the second OPTIONAL parameter, you MUST also explicitly supply the first as an empty string ("") The final expression MUST be supplied. This indicates whether references to subfolders are explicitly included in the CSV file, if the CSV file has a row for each subfolder the exact string "includeFolder" should be given, if the subfolders do not have explicit references, the exact string "excludeFolder" should be given.
Default treatment of case sensitivity should follow the norms of the relevant file system, implementations may wish to include some means to over-ride this, but that is outside the scope of the EBNF.
[67] | IntegrityCheckExpr | ::= | "integrityCheck" "(" (StringProvider ",")? (StringProvider ",")? ("\"includeFolder\"" | "\"excludeFolder\"") ")" |
a_column: integrityCheck("includeFolder") /*the validator should check all file system folders for files that are not referenced in the CSV file the "includeFolder" parameter indicates that there are explicit references to all file folders in the CSV file as the second parameter has not been supplied it defaults to the value "content" meaning that all sub folders must sit within a folder with that name*/ another_column: integrityCheck("file:///C:/","excludeFolder") //here the string "file:///C:/" is prepended to the contents of another_column before the integrity check is made third_column: integrityCheck("","","excludeFolder") //here as an strings are passed for both optional parameters, we indicate that there is no content folder
A Checksum Expression allows the verification of a checksum value supplied in a CSV file by calculating the checksum for a specific file, and comparing it against the supplied value. You can also indicate the type of checksum algorithm to be used. You MUST supply both the file location and a checksum algorithm. File location is given in the form of a File Expression.
The EBNF does not specify valid values for the String Literal representing the checksum algorithm, that is instead implementation defined, however
it is strongly recommended that implementations SHOULD at least support: MD5
, SHA-1
and SHA-256
. It is important to note that the checksum value MUST
use lowercase hexadecimal characters only.
[68] | ChecksumExpr | ::= | "checksum(" FileExpr "," StringLiteral ")" |
file_path: uri //a full filepath for a file file_name: //a filename only for a file a_column: checksum(file($file_path),"SHA-256") /*a_column contains the SHA-256 checksum value supplied for the file, the validator then computes the checksum for the file at file_path and reports an error if they do not match*/ another_column: checksum(file("file:///C:/my_folder/",$file_name),"MD5") /*here the string "file:///C:/my_folder" is prepended to the contents of file_name before the checksum verification is performed, another_column has an MD5 checksum value in it
A File Count Expression allows a column representing the number of files in a particular folder to be verified against the actual files on disk. You MUST provide a File Expression as an input parameter which points to a filesystem folder to compare the count of files against.
[69] | FileCountExpr | ::= | "fileCount(" FileExpr ")" |
file_path: uri //a full filepath for a folder another_column: positiveInteger fileCount(file($file_path)) //another_column contains an integer value, this is checked against the number of files in the file system folder at file_path
The sub-expression used in conjunction with External Single Expressions in order to provide input is a generic File Expression which itself takes two parameters. The first parameter is OPTIONAL, a String Provider that is prepended to the second parameter to create a full path in the event that a column holds only a filename rather than a full filepath. You MUST supply the second parameter which is a String Provider, which resolves to the name of the file.
[70] | FileExpr | ::= | "file(" (StringProvider ",")? StringProvider ")" |
The File Expression is always used as an input parameter to one of the other file expressions, several examples are given above. Here are explained the two parameters that this expression itself takes.
file("file:///a/string/for/a/URI/representing/the/location/of/a/file") /*in the simplest case a single parameter is supplied that's a string representing the full URI of a file. Since it's a URI, characters such as space must be escaped (a space becomes %20) you can either use a string literal as here, or pass a reference to a column using $column_name*/ file("file:///a/string/to/prepend/to/a/filename/to/make/a/full/path","filename") /*Here you provide a string (or reference to a string) that is the base path to prepend to "filename" to get the full path to your file*/
Many Single Expressions and External Single Expressions take a String Provider as an input. A String Provider takes the form of either a Column Reference, a String Literal, Concatenation Expression or a No Extension Argument Provider.
A Column Reference comprises a dollar sign ($)
, i.e. the [[UTF-8]] character code 0x24
,
followed by a Column Identifier or Quoted Column Identifier.
The following are new expressions in CSV Schema Language 1.1
The final two string providers are recursive, themselves taking one or more String Providers as arguments,
and returning a new String Provider.
The Concatenation Expression takes two or more String Providers,
returning a new string that is the concatenation of all those supplied. You MUST provide at least two parameters.
The No Extension Argument Provider removes anything that appears to be a Windows
file extension from the end of a supplied String Provider, and returns a new string.
A string that does not contain a full stop (.)
, i.e. the [[UTF-8]] character codes 0x2D
will be returned unchanged.
[71] | StringProvider | ::= | ColumnRef | StringLiteral | ConcatExpr | NoExtExpr |
[72] | ConcatExpr | ::= | "concat(" StringProvider ("," StringProvider)+ ")" |
[73] | NoExtExpr | ::= | "noExt(" StringProvider ")" |
a_column: is("no file") another_column: any("file:///","https://") third_column: any("C:","example.com") fourth_column: ends(".html") fifth_column: is($a_column) or is(concat($another_column,$third_column,"/",noExt($fourth_column),".pdf") /*in this rather artificial example, fifth_column must be either "no file" (the value of a_column) or a PDF file with the same basic name as the HTML file named in fourth_column, located at either file:///C:/ or https://example.com/ (in fact as written you could have file:///example.com/ or https://C:/ as well)*/
Parenthesized Expressions are used to vary the standard left-to-right evaluation order of evaluation of Combinatorial Expressions. Parenthesized Expressions can be nested, the deepest level will be evaluated first, working outwards. Equally nested Parenthesized Expressions revert to the standard left-to-right evaluation order.
[74] | ParenthesizedExpr | ::= | "(" ColumnValidationExpr+ ")" |
A Conditional Expression is used to apply different Column Validation Expressions to a column on the basis of the result of the evaluation of some other Non Conditional Expression. This is particularly useful when the data expected in one column depends on the value of another column. In the original CSV Schema Language 1.0 there was only one form of Conditional Expression, the If Expression. CSV Schema Language 1.1 introduces the Switch Expression which allows a more compact and readable form for what would otherwise be written as nested If Expressions.
[75] | ConditionalExpr | ::= | IfExpr | SwitchExpr |
The If Expression is the original form of Conditional Expression introduced in CSV Schema Language 1.0.
It takes three input parameters: the first two of these MUST be used, first a Combinatorial Expression or Non Conditional Expression;
when that evaluates to true
, the second parameter, one or more Column Validation Expressions, are applied; the third parameter is OPTIONAL,
these are the else Column Validation Expressions used when the first parameter evaluates to false
.
[76] | IfExpr | ::= | "if(" (CombinatorialExpr | NonConditionalExpr) "," ColumnValidationExpr+ ("," ColumnValidationExpr+)? ")" |
a_column: any("true","false") another_column: any("yes","no") third_column: if($a_column\is("true"),is("some string"),is("some other string")) /*here we look to the value of a_column, if it is "true", third_column must be "some string" otherwise third_column must be "some other string"*/ fourth_column: if(($a_column\is("true") and $another_column\is("yes")),is("some string"),is("some other string")) //in fourth_column we use a more complicated combinatorial expression as the initial test, if both a_column is "true" and another_column is "yes", //then fourth_column is "some string", otherwise fourth_column is "some other string". All column expressions could be used for the test etc, only is is used for simpilcity
This is a new expression in CSV Schema Language 1.1
The Switch Expression generalises the If Expression. It comprises at least one Switch Case Expression followed by a final OPTIONAL parameter,
one or more Column Validation Expressions, (this is the else expression) applied when all previous test expressions have evaluated to false
.
Note that evaluation is halted after the first test expression to return true
.
The Switch Case Expression takes two parameters (equivalent to the first two non-optional parameters of the If Expression),
first a Combinatorial Expression or Non Conditional Expression;
when that evaluates to true
, the second parameter, one or more Column Validation Expressions, are applied.
At least one Switch Case Expression MUST be used within a Switch Expression, but there is no limit on the maximum number used.
[77] | SwitchExpr | ::= | "switch(" SwitchCaseExpr+ ("," ColumnValidationExpr+)? ")" |
[78] | SwitchCaseExpr | ::= | "("( CombinatorialExpr | NonConditionalExpr) "," ColumnValidationExpr+ ")" |
a_column: any("true","false","unknown") another_column: if($a_column\is("true"),is("some string"),if($a_column\is("false"),is("some other string"),is("some third string"))) /*here we look to the value of a_column, if it is "true", another_column must be "some string" otherwise, check if a_column is "false), if so another_column must be "some other string", otherwise another_column is "some third string". Nesting if statments like this can quickly get very difficult to read, so instead we can use the switch statement*/ third_column: switch(($a_column\is("true"),is("some string")),($a_column\is("false"),is("some other string")),is("some third string")) //this switch statement for third_column is functionally identical to the nested if statement demonstrated on another_column, and is much easier to extend //if there were additional options available for a_column, each bracketed pair such of test and column validation expression, //such as ($a_column\is("true"),is("some string")), is called a Switch Case Expression, as many as required are then followed by a final column validation expression, //used if none of the Switch Case Expressions evaluate to true.
Additional examples for a range of Column Expressions is given below. A greater range of example and other schemas actually used by The National Archives can be found on GitHub. Most of these are extensively commented in order to explain usage. There is a also a set of example files to be downloaded which allow File Exists Expressions and Checksum Expressions and path substitutions to be more easily understood, these are designed to be used with the generic_digitised_surrogate_tech_acq_metadata_v1.1.csvs and generic_digitised_surrogate_tech_acq_metadata_v1.0.csvs schemas, which also helps demonstrate the additional checking that CSV Schema Language 1.1 enables.
piece: is("1") and (in($file_path) and in($resource_uri)) /*The column "piece" must have the specific value 1 the value must also be part of the value of the columns "file_path" and "resource_uri" explicit And Expression is used between each specified Column Expression*/ item: range(1,540) unique($piece,$item) //this field must contain an integer between 1 and 540 inclusive. the combination of piece and item must be unique within the file. file_uuid: uuid4 unique /*must be a version 4 uuid, and the value must be unique within the file (uuids must be lower case). Here an implicit And Expression is used*/ file_path: fileExists uri starts(concat("file:///",$piece,"/",$item)) /*fileExists checks that there is actually a file of the given name at the specified location on the file system which is assumed to be the value held in "file_path". We know the location should be in the form of a URI so a URI expression is used, and in particular this should be a file url, so we further specify that the data in the column must start "file:///", then (a folder named for) the piece id, then a /, then the item id */ file_checksum: checksum(file($file_path),"SHA-256") /* Compare the value given in this field to the checksum calculated for the file found at the location given in the "file_path" field. Use the specified checksum algorithm (SHA-256) (must use lowercase hex characters). */ image_split: is("yes") or is("no") //must be string: yes; or string: no (precisely - case as shown) image_split_other_uuid: if($image_split/is("yes"),uuid4,is("")) //if "image_split" field is yes, must be a uuid4, else must be blank image_split_operator: if($image_split/is("yes"),length(1,12) and regex("^[0-9a-zA-Z]{1,12}$"),is("")) /*If "image_split" field is the string: yes (precisely) then field must be 12 characters long. This is further restricted by regex statement to being only alphanumeric characters (upper and lower case). */ image_split_timestamp: if($image_split/is("yes"),xDateTime(2013-12-04T00:00:00+00:00,2014-03-04T23:59:59+00:00),is("")) /*If "image_split" field is string: yes (precisely) then timestamp for image split, compliant with XSD DateTime data type and in range 4 December 2013 - 4 March 2014 (from the midnight starting 4 December, to last second of 4 March), in the UTC (Greenwich Meantime) timezone, else it must be blank (ie "image_split" is no). As xDateTime, rather than xDateTimeTz, is specified, the use of the timezone component within the supplied date time is optional, eg both: 2013-12-05T12:34:00+00:00 and 2013-12-05T12:34:00 would be acceptable in the metadata/*
Most Column Validation Expressions rely on a small number of underlying data types. Some of these are defined by means of a regular expression embedded in the EBNF as indicated by xgc:regular-expression. There are 11 data types, XSD Date Time Literal, XSD Date Literal, XSD Time Literal, UK Date Literal, Positive Non Zero Integer Literal, Positive Integer Literal, Numeric Literal, String Literal, Character Literal, Wildcard Literal and Identifier.
An XSD Date Time Literal MUST be a valid XML Schema dateTime data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form yyyy-mm-ddThh:mm:ss(.MMMMM) (year-month-day followed by time in hours, minutes, seconds and OPTIONAL fractions of a second). There are two OPTIONAL parts, a minus sign MAY be used for BC dates, and there MAY be a suffix indicating the applicable timezone as an offset from GMT/UTC. GMT itself may be indicated by a suffix Z for Zulu. It is represented in the EBNF by a regular expression defining precisely which characters are to be used.
[79] | XsdDateTimeLiteral | ::= | XsdDateWithoutTimezoneComponent "T" XsdTimeLiteral |
An XSD Date Time With Time Zone Literal MUST be a valid XML Schema dateTime data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form yyyy-mm-ddThh:mm:ss(.MMMMM) (year-month-day followed by time in hours, minutes, seconds and OPTIONAL fractions of a second). There is one OPTIONAL part, a minus sign MAY be used for BC dates. However, unlike XSD Date Time Literal there MUST be a suffix indicating the applicable timezone as an offset from GMT/UTC. GMT itself may be indicated by a suffix Z for Zulu, or as +00:00. It is represented in the EBNF by a regular expression defining precisely which characters are to be used.
[80] | XsdDateTimeWithTimeZoneLiteral | ::= | XsdDateWithoutTimezoneComponent "T" XsdTimeWithoutTimezoneComponent XsdTimezoneComponent |
An XSD Date Literal MUST be a valid XML Schema date data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form yyyy-mm-dd (year-month-day). There is one OPTIONAL part, a minus sign MAY be used for BC dates. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is also used as the date part of XSD Date Time Literal.
[81] | XsdDateLiteral | ::= | XsdDateWithoutTimezoneComponent XsdOptionalTimezoneComponent |
An XSD Time Literal MUST be a valid XML Schema date data type (see [[!XMLSCHEMA-2]] and [[!ISO8601]]) of the basic form hh:mm:ss(.MMMMM) (time in hours, minutes, seconds and OPTIONAL fractions of a second). There is one OPTIONAL part, there MAY be a suffix indicating the applicable timezone as an offset from GMT/UTC. GMT itself may be indicated by a suffix Z for Zulu. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is also used as the time part of XSD Date Time Literal.
[82] | XsdTimeLiteral | ::= | XsdTimeWithoutTimezoneComponent XsdOptionalTimezoneComponent |
The various XSD Date and Time data types from [[!XMLSCHEMA-2]] are made up from common reuseable components that are defined by regular expressions.
[83] | XsdDateWithoutTimezoneComponent | ::= | -?[0-9]{4}-(((0(1|3|5|7|8)|1(0|2))-(0[1-9]|(1|2)[0-9]|3[0-1]))|((0(4|6|9)|11)-(0[1-9]|(1|2)[0-9]|30))|(02-(0[1-9]|(1|2)[0-9]))) | /* xgc:regular-expression */ |
[84] | XsdTimeWithoutTimezoneComponent | ::= | ([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(\.[0-9]{3})? | /* xgc:regular-expression */ |
[85] | XsdOptionalTimezoneComponent | ::= | ((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z)? | /* xgc:regular-expression */ |
[86] | XsdTimezoneComponent | ::= | ((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z) | /* xgc:regular-expression */ |
A UK Date Literal is a data type representing the usual UK format for writing dates, dd/mm/yyyy. It is represented in the EBNF by a regular expression defining precisely which characters are to be used.
[87] | UkDateLiteral | ::= | (((0[1-9]|(1|2)[0-9]|3[0-1])\/(0(1|3|5|7|8)|1(0|2)))|((0[1-9]|(1|2)[0-9]|30)\/(0(4|6|9)|11))|((0[1-9]|(1|2)[0-9])\/02))\/[0-9]{4} | /* xgc:regular-expression */ |
A Positive Non Zero Integer Literal is a data type representing the positive integer natural numbers. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is a specialisation of Positive Integer Literal.
[88] | PositiveNonZeroIntegerLiteral | ::= | [1-9][0-9]* | /* xgc:regular-expression */ |
A Positive Integer Literal is a data type representing the non-negative integer natural numbers. It is represented in the EBNF by a regular expression defining precisely which characters are to be used. It is a specialisation of Numeric Literal.
[89] | PositiveIntegerLiteral | ::= | [0-9]+ | /* xgc:regular-expression */ |
A Numeric Literal is a data type representing any real number expressed as an integer or decimal.
[90] | NumericLiteral | ::= | -?[0-9]+(\.[0-9]+)? | /* xgc:regular-expression */ |
A String Literal is zero or more characters (excluding quotation mark) encased witin quotation marks, i.e. the [[UTF-8]] character code 0x22
.
[91] | StringLiteral | ::= | "\"" [^"]* "\"" |
A Character Literal is a single non-breaking character encased within apostrophes, i.e. the [[UTF-8]] character code 0x27
.
[92] | CharacterLiteral | ::= | "'" [^\r\n\f'] "'" |
A Wildcard Literal is a single asterisk (*)
, i.e. the [[UTF-8]] character code 0x2A
.
[93] | WildcardLiteral | ::= | "*" |
An Identifier is the set of characters which can be used in an ordinary Column Identifier. Upper and lower case alphabetic characters (unaccented), along with
digits 0-9, the hyphen-minus (-)
, low line (_)
and full stop (.)
, i.e. the [[UTF-8]] character codes 0x2D
, 0x5F
and 0x2E
.
[94] | Ident | ::= | /* [A-Za-z0-9\-_\.]+ */ |
An implementation MUST first check that the provided CSV Schema(s) are syntactically correct. If not, a Schema Error is produced, and no further validation of the CSV Schema(s) or provided CSV files(s) SHOULD be undertaken. If the schema check is successful then an implementation MAY continue with further CSV Schema(s) and CSV file validation.
If an implementation performs validation of a CSV file against a CSV Schema, a report SHOULD be produced for each Column Validation Expression that fails validation; This is generally considered a Validation Error, unless the Warning Directive has been used to reduce the severity of an error within a specific Column Rule to a Validation Warning.
A Schema Error is caused by syntax errors in the definition of the CSV Schema. These might include for example: an incorrect Version Declaration, or a mismatch between the number specified for Total Columns Directive and the actual number of Column Definitions given in the Body of the Schema. Schema Errors would also be produced by mismatched Parenthesized Expressions, unrecognised Column Validation Expressions and Explicit Context Expressions which do not match an actual Column Identifier.
An implementation MUST report a Schema Error.
If column data does not validate successfully against a Column Rule, an implementation SHOULD report a Validation Error. It is implementation defined whether a Validation Error terminates execution, or whether execution continues. If execution continues, any further errors SHOULD be reported.
NOTE The Warning Directive may be used within a Column Rule to specify that what would normally be a Validation Error should be treated only as a Validation Warning.
A Validation Warning can be used when you wish to highlight unexpected values that are encountered in the data, but for some reason they are not to be considered failures of validation. For example within archival documents the date may have been recorded as an impossible date, such as 30 February or 31 April. A transcriber has correctly entered the data as seen on the original document, but yet it is not a valid date. You may wish to highlight these cases for additional QA, but it should not be considered an error.
This Appendix specifies the media type for CSV Schema Version 1.0 and CSV Schema Version 1.1. CSV Schema is a language for describing and validating CSV files, as specified in the main body of this document. This media type has been submitted to the IESG (Internet Engineering Steering Group) for review, approval, and registration with IANA (Internet Assigned Numbers Authority.)
The text/csv-schema
media type, is intended to be used for transmitting schemas written in the CSV Schema language.
The suggested file extension for use when naming CSV Schema files is .csvs
.
The grammar of CSV Schema uses the same simple EBNF (Extended Backus-Naur Form) notation as [[!XML10]] with the following minor differences.
/*
and */
symbols.xgc:
prefix is an extra-grammatical constraint, the details of which are explained in To increase readability, the EBNF in the main body of this document omits some of these notational features. This appendix is the normative version of the EBNF.
Link conventions used in this appendix:
XsdDateTimeLiteral | ::= | XsdDateWithoutTimezoneComponent "T" XsdTimeLiteral | ||
XsdDateTimeWithTimeZoneLiteral | ::= | XsdDateWithoutTimezoneComponent "T" XsdTimeWithoutTimezoneComponent XsdTimezoneComponent | ||
XsdDateLiteral | ::= | XsdDateWithoutTimezoneComponent XsdOptionalTimezoneComponent | ||
XsdTimeLiteral | ::= | XsdTimeWithoutTimezoneComponent XsdTimezoneComponent | ||
XsdDateWithoutTimezoneComponent | ::= | -?[0-9]{4}-(((0(1|3|5|7|8)|1(0|2))-(0[1-9]|(1|2)[0-9]|3[0-1]))|((0(4|6|9)|11)-(0[1-9]|(1|2)[0-9]|30))|(02-(0[1-9]|(1|2)[0-9]))) | /* xgc:regular-expression */ | |
XsdTimeWithoutTimezoneComponent | ::= | ([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(\.[0-9]{3})? | /* xgc:regular-expression */ | |
XsdOptionalTimezoneComponent | ::= | ((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z)? | /* xgc:regular-expression */ | |
XsdTimezoneComponent | ::= | ((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z) | /* xgc:regular-expression */ | |
UkDateLiteral | ::= | (((0[1-9]|(1|2)[0-9]|3[0-1])\/(0(1|3|5|7|8)|1(0|2)))|((0[1-9]|(1|2)[0-9]|30)\/(0(4|6|9)|11))|((0[1-9]|(1|2)[0-9])\/02))\/[0-9]{4} | /* xgc:regular-expression */ | |
PositiveNonZeroIntegerLiteral | ::= | [1-9][0-9]* | /* xgc:regular-expression */ /* A Natural Number, positive integer */ | |
PositiveIntegerLiteral | ::= | [0-9]+ | /* xgc:regular-expression */ /* A Natural Number, non-negative integer */ | |
NumericLiteral | ::= | -?[0-9]+(\.[0-9]+)? | /* xgc:regular-expression */ /* A Real Number, expressed as an integer or decimal */ | |
StringLiteral | ::= | "\"" [^"]* "\"" | /* xgc:regular-expression */ /* Any characters except: quotation mark */ | |
CharacterLiteral | ::= | "'" [^\r\n\f'] "'" | /* xgc:regular-expression */ /* Any characters except: carriage-return, line-break, form-feed and apostrophe */ | |
WildcardLiteral | ::= | "*" | ||
Ident | ::= | [A-Za-z0-9\-_\.]+ | /* xgc:regular-expression */ |
The right-hand side of the expression is expressed using a Regular Expression. The Regular Expression syntax used is that from Java's Pattern class.
Implies that each distinct symbol in the expression may appear in any order.
Many thanks to: