CSV Validator is a CSV validation and reporting tool which implements
CSV Schema Language.
Released as Open Source under the Mozilla Public Licence version 2.0.
This page is for CSV Validator 1.1 (and related minor releases), the equivalent page for the previous 1.0 release is now to be found at csv-validator-1.0.html.
The CSV Validator will take a CSV Schema file and a CSV file, verify that the CSV Schema itself is syntactically correct and then assert that each rule in the CSV Schema holds true for the CSV file. The Validator is predominantly written in Scala 2.11 and run's on any platform with a Java Virtual Machine (JVM). The Validator toolset provides:
The National Archives receive Metadata along with Collections of Digitised or Born-Digital Collections. Whilst The National Archives typically process Metadata in XML and RDF, it was recognised that it was too difficult and/or expensive for many suppliers to produce the desired metadata in XML and/or RDF, as such it was decided that Metadata would be received in CSV format.
Our experience shows that when suppliers are asked to produce metadata in XML or RDF there are several possible barriers:
The National Archives set exacting requirements on the Metadata that they expect and the format of that Metadata. Such constraints enable them to automatically process it, as the semantics of the metadata are already defined. Whilst previous bespoke tools have been developed in the past for validating data in various CSV files, it was felt that a generic open tool which could be shared with suppliers would offer several benefits:
CSV Validator 1.1 introduces the new features of CSV Schema Language 1.1. Other changes from the original 1.0 release include full column name validation: a schema error will now be produced if the names of columns listed in the schema file do not match the actual names of the columns in the CSV file to be validated (when a header row is present). The output pane of the GUI version of the validator is now cleared when a new validation run is begun, rather than leaving output of a previous run visible until new output is produced.
Release candidates for CSV Validator 1.2, introducing the new features in the draft CSV Schema Language 1.2 are also being made available. So far (other than the version number shown), there are no interface changes, so features and screenshots should be equally applicable.
The core of the CSV Validator is written in Scala 2.11, Scala runs in the Java Virtual Machine, so requires Java 7 or newer to be installed on your computer in order to run. The latest release build is automatically published from the source code repository on GitHub, to Maven Central currently (26 January 2016), Version 1.1.5.
To install, download the appropriate application.zip file for either csv-validator-ui (for the GUI Application) or csv-validator-cmd (for the Command Line Application). Simply extract the entire contents of the zip to your desired installation location. The validation components are identical, but csv-validator-ui contains additional components to create the GUI interface. The other difference between the packages is in the .bat and shell scripts supplied. For csv-validator-ui they launch the GUI, for csv-validator-cmd they accept command line parameters to allow the CSV file and CSV schema to be selected (and other options to be used).
The latest release candidate for CSV Validator 1.2 can also be found on Maven Central currently (26 January 2017), Version 1.2-RC2.
In order to run the CSV Validator via the GUI, navigate to the directory where you installed csv-validator, and then to the bin subdirectory and run either:
To validate data you must supply a full filepath for the csv file to be validated, and for the schema it is to be validated against. You can either type, or copy-and-paste, the filepath into the appropriate box, these are labelled "CSV file" and "CSV Schema file" respectively. Alternatively, click the buttons labelled "..." (highlighted in the image below):
This will open up a standard "File open" dialogue, allowing you to navigate to the file in the file system, and select it:
Clicking "Open" will populate the related text box:
While in the most straightforward cases, you can validate a CSV file immediately after supplying the filepaths, in many instances you will need to supply Settings options. Using the Settings is equivalent to providing options when running via the Command Line (note that there is one additional option at the command line that is not available within Settings, to produce a detailed report on the parse of the schema itself). To open up Settings, click on the downward facing double arrow below the file dialogue buttons (highlighted below):
Having opened up Settings, you will see that there are six sets of options:
CSV Encoding, which has a dropdown menu giving the choice of:
CSV Schema Encoding, which has a dropdown menu giving the choice of:
Fail on first error?;
Enforce case-sensitive filepath checks?;
Validate CSV for valid UTF-8 characters, and;
With this setting selected, validation will terminate when the first
Validation Error is encountered in the CSV file.
If there are no validation errors the whole CSV file will be validated and return
You should be aware that the CSV Schema can declare that some validation tests will produce only validation warnings, rather than validation errors.
A validation warning will not stop validation, even with "Fail on first error" set.
This setting is equivalent to using the -f true or --failfast true options when running via the Command Line.
With this setting selected, validation treats filepaths as being case sensitive, even if the file system itself is not (e.g. on Windows NTFS). Without doing this data which appears to validate on a Windows machine may then fail when further processing is carried out on a system where the file system is case sensitive (e.g. Unix/Linux). This setting is equivalent to using the -c true or --case-sensitive-paths true options when running via the Command Line.
This setting is on by default, meaning that the validator also verifies that the CSV file appears to contain only UTF-8 characters. It is ignored if CSV Encoding has been changed to something other than UTF-8, and can otherwise be disabled by unticking the checkbox. This setting is equivalent to using the --disable-utf8-validation option when running via the Command Line.
Path Substitutions are a more slippery concept to grasp. As described in the Background section above, the original use case at The National Archives was for a tool to validate metadata files. These files generally describe a set of digital files being transferred to The National Archives and so contain information, including a filepath and checksum, about each file in a particular transfer. In addition to checking that the metadata files are correctly structured, we wish to check that the file actually exists at the filepath specified, and that when recalculated, the checksum for the files matches the value supplied in the CSV file. To this end, the CSV Schema language contains a set of so-called External Single Expressions, which allow reference to other file system resources outside the CSV file being validated. However, it is likely that after transfer the drive letter or mount point (at least) will have changed from the original. Path Substitutions allow us to overcome this at validation time by effectively performing a simple find-and-replace within the file during validation which maps the supplied filepath to the equivalent on the system where validation is actually taking place. It is also important to note that the CSV Validator expects filepaths to be supplied as file URIs (so on Windows systems, you may need to reverse the slashes).
Many path substitutions can be supplied to allow for the case where the content files are in many different locations.
To create a Path Substitution in the GUI, click the "Add Path Substitution..." button (highlighted):
This will open a popup:
in the first box enter the text string which will be found in the supplied CSV file, and in the second the text string which should be substituted. So considering the examples already used in this document, the CSV file is found at Y:\UAT_2\ADM362B003\tech_acq_metadata_v1_ADM362B003.csv and we can see that there is a folder named ADM_362 alongside the CSV file, that folder contains the images which the CSV file describes. When created, the metadata CSV file and images were on the C: drive, instead of the Y: drive, and they are also now in a project-related folder structure. So the filepaths in the metadata CSV file all begin file:///c:/ADM_362/ - when we run the validator, we actually need to look for the content files at file:///y:/UAT_2/ADM362B003/ADM_362/ when we are checking for their existence and verifying the associated checksums. So we enter those filepath fragments into the "From:" and "To:" boxes respectively, and click OK. This gives:
Creating a Path Substitution in this way is equivalent to running via the Command Line with the -p or --path flags and supplying a key:value pair.
To edit a substitution, simply double-click on the field you wish to edit:
Hit <Enter> or <Tab>, or click on the other side of the substitution, to complete the edit.
To completely remove a path substitution, right-click on an existing path substitution definition, this will give a context menu with a single option,
Remove Path Substitution", clicking that will delete the path substitution definition.
Having selected a CSV file and schema, and chosen which settings (if any) to apply, validation is performed by clicking the large "Validate" button. As of Version 1.0-RC2, there is a progress bar (but due to a threading issue) that and output to the report pane (below the "Validate" button) is currently only updated when all rows have been validated (Issue #34 has been raised to address this). If checksum verification is being performed, the validation can take a considerable amount of time, as each file described in the CSV file has to be read in full in order to calculate the checksum. The mouse pointer will change to the egg-timer (or equivalent) while validation is taking place.
If the data is completely clean, and validates perfectly against the schema, the only output returned to the output pane will be
If you are developing schemas, it is possible to make syntax errors in the schema, for example a mismatch in the declared number of columns
(using the Total Columns Directive) against the actual number of
Column Definitions included in the schema (eg
@totalColumns = 41 but number of columns defined = 42 at line: 2, column: 1).
The schema itself is always checked before validation of the data begins, and schema errors always terminate validation.
Or, if the Version Declaration has been omitted from the schema, or is incorrect,
you will see a schema error saying that the Version Declaration is not present
[1.1] failure: version 1.0 missing or incorrect). You will also see this if you accidentally switch over the input filepaths in the validator,
and put the filepath for the CSV data file into the field for the schema and vice versa:
When Validation Errors occur they will be reported in the output pane on the completion of validation.
The report will be prefixed
Error: and will then show the Validation Rule where the error has occurred,
the line and data column from the CSV file where the error has occurred, the actual data present in the file, and for rules such as checksum validation,
what the computed value of the data was.
The line number refers to data lines only, so if the CSV file contains a header row, you may see an apparent discrepancy in the line numbers displayed when you view
the data in a text editor or spreadsheet program, compared to the line number indicated by the csv-validator.
Error: checksum(file($file_path), "SHA-256") file "file:///TEST_1/1/1/1_1_001.xml" checksum match fails for line: 1, column: file_checksum, value: "fb58b56a17af0f52cf794c108e0c1574a3a2c02b25e22699668bb43801028432". Computed checksum value:"fb58b56a17af0f52cf794c108e0c1574a3a2c02b25e22699668bb43801028431" FAIL
The next image shows a more varied selection of Validation Errors, and demonstrates that the basic format of the error messages is consistent:
The full text of each error message is shown below, the errors are: a checksum mismatch, note that both the checksum read from the CSV file, and the actual computed checksum value are shown; then a regex error, the regex shown only allows the number before .xml to go up to 010, rather than value of 011 that has been put in the CSV file; the concatenation of the various strings and field values shown does not match the value actually found in the CSV file (in this instance you would have to manually check the CSV file to see which field is in error); the image_width value is outside the specified range; and finally, an incorrect value ahs been supplied for image_format.
Error: checksum(file($file_path), "SHA-256") file "file:///TEST_1/2/2/2_2_010.xml" checksum match fails for line: 40, column: file_checksum, value: "f7ab46b417eecb57c43e62996646780c1e9e2aa7da4af361dc4616c5aff26f65". Computed checksum value:"d59a033ebd100b08f1c93a0475c9f33f898bb13740fed61f8e86435d01a70e86" Error: regex(".*__((00[1-9])|(010)).xml$") fails for line: 41, column: file_path, value: "file:///TEST_1/2/2/2_2_011.xml" Error: is(concat("http://datagov.nationalarchives.gov.uk/66/", $department, "/", $series, "/", $piece, "/", $item, "/", $file_uuid)) fails for line: 41, column: resource_uri, value: "http://datagov.nationalarchives.gov.uk/66/TEST/1/2/2/ab677099-950e-4853-b2d4-6d39dc1b8722" Error: range(2000,*) fails for line: 41, column: image_width, value: "234" Error: is("x-fmt/392") fails for line: 41, column: image_format, value: "x-fmt/391" FAIL
If you wish to save the validation output as a report, for example, in order to send it to whoever supplied the data so that they can correct it, simply click the "Save" button towards the bottom right of the CSV Validator window. If you have opened the settings pane, you may need to close it again, or move your mouse to the bottom of the window and drag downwards to expand it, in order to be able to see this button. This will open a standard save dialogue box. Save the output as a simple text file at your desired location.
All standard means of closing the window should work, or click the close button at the bottom right of the validator window. If you have opened the settings pane, you may need to close it again, or move your mouse to the bottom of the window and drag downwards to expand it, in order to be able to see this button.
In order to run the CSV Validator via the command line, set your working directory to the bin subdirectory of the installation directory and execute either:
validate.baton Windows operating systems, or;
validateshell script on Unix/Linux.
Error: Missing argument
Error: Missing argument Try --help for more information.
Subsequently, executing with the
--help argument, should produce the following help text:
CSV Validator - Command Line 1.0 Usage: validate [options] <csv-path> <csv-schema-path> --help Prints this usage text -t | --trace-parser Prints a trace of the parser parse -f <value> | --fail-fast <value> Stops on the first validation error rather than reporting all errors -p:<key>=<value> | --path:<key>=<value> Allows you to substitute a file path (or part of) in the CSV for a different file path -c <value> | --case-sensitive-paths <value> Enforces case-sensitive file path checking. Useful when validating on case-insensitive filesystems like Windows NTFS -x <value> | --csv-encoding <value> Defines the charset encoding used in the CSV file -y <value> | --csv-schema-encoding <value> Defines the charset encoding used in the CSV Schema file --disable-utf8-validation Disable UTF-8 validation for CSV files. --show-progress Show progress <csv-path> The path to the CSV file to validate <csv-schema-path> The path to the CSV Schema file to use for validation
As a minimum you must supply two command line parameters in order to validate a CSV file. The first parameter is the path to the CSV file you wish to validate, the second the path to the schema. If a the path contains spaces, wrap it appropriately, e.g. in double quotes (") on Windows.
Nine options (plus --help, which merely reproduces the usage text given above) are available to modify certain behaviours of the CSV Validator, or to provide debugging assistance.
The trace option produces terminal output describing how the CSV Schema has been parsed by the CSV Validator. Complex schemas have sometimes taken several minutes to parse, or even hung completely, in earlier versions of the CSV Validator. This output allows developers to understand where these problems have arisen in order to troubleshoot the parser.
This option is only available at the command line, and not via the Settings panel in the GUI version of the application.
To use the fail fast option, supply either of the flags
--fail-fast along with a (boolean) value.
1 as the supplied value switches the option on. With this option in force,
validation will be terminated when the first Validation Error is encountered.
By default (or if you supply
0 with this option flag) validation continues to the end of the CSV file
being validated, and all validation errors are reported.
However, Schema Errors always cause immediate termination of validation, regardless of the use of this option.
To use the path substitution option, supply either of the flags
--path along with a
The reasons for using path substitutions are discussed in more detail in Path Substitutions section under the use of GUI.
key represents the string to be found in the CSV file being validated, the
value the string to replace that with.
For the example path substitution given above, (in the Path Substitutions section under the use of GUI) the command line parameter would be entered as:
Note that the paths are expressed as a URL, so characters (such as spaces) that are not permitted in URLs will need to be escaped (see Example CSV Schemas below).
To use the case sensitive paths option, supply either of the flags
--case-sensitive-paths along with a (boolean) value.
1 as the supplied value switches the option on. With this option in force,
file paths used with Column Validation Expressions such as
File Exists Expressions will be treated as being case sensitive even on operating systems where
that would not normally be the case, such as Windows NTFS.
By default (or if you supply
0 with this option flag) the behaviour of the host operating system will be used.
This helps to avoid inconsistent validation results when validating the same CSV file and schema on different operating systems.
To specify that the CSV file that you are validating uses a different character set encoding than the default (
UTF-8 without byte order marker),
supply either of the flags
--csv-encoding along with the name of the character set encoding.
To specify that the CSV Schema file that you are parsing uses a different character set encoding than the default (
UTF-8 without byte order marker),
supply either of the flags
--csv-schema-encoding along with the name of the character set encoding.
By default the validator also verifies that the CSV file appears to contain only UTF-8 characters.
UTF-8 validation is automatically disabled if the flags
have been used to specify an encoding other than UTF-8. To disable UTF-8 validation for some other reason,
use the flag --disable-utf8-validation (there is deliberately no short form for this flag)
Setting the flag
--show-progress causes the validator to produce an output statement for each line of the CSV file,
processing 1 of 41. The first number represents the line number within the file (including any header row),
while the second number represents the total number of lines in the file (including any header row).
This is likely to be extended in future releases to allow finer grained control over the frequency of output, particularly for larger CSV files.
The file system path for the CSV to be validated, expressed in the standard manner for the operating system you are using.
The file system path for the CSV Schema to be used for validation, expressed in the standard manner for the operating system you are using.
All output is written to stdout by default, and so can be written to file by the usual terminal redirection means. Reporting is otherwise identical to that produced by the GUI client, which is described in the section Validating a CSV file and understanding program output above.
In order to understand how to use the Validator in practice, see the example CSV Schema file, digitised_surrogate_tech_acq_metadata_v1_TESTBATCH000.csvs, in the GitHub repository digital-preservation/csv-schema. In the example-data subfolder you will find a CSV file, digitised_surrogate_tech_acq_metadata_v1_TESTBATCH000.csv, which complies with the schema. This CSV file refers to XML files in the folder structure below TEST_1
If you download the csv-schema repository and run the csv-validator (by either of the methods described above), you will find that initially it reports validation errors
as it cannot find the XML files for the fileExists or checksum column validation tests. If you create a path substitution, replacing file:///TEST_1/ with
file:///<your system path>/TEST_1/ (if you simply placed the repository on your C:\ drive on a Windows system, <your system path> will look something like,
C:\csv-schema\example-schemas\example-data, but as it's a URL, you should reverse the slashes, C:/csv-schema/example-schemas/example-data - in addition if there are any spaces in the path,
you must escape them by replacing them with %20) you should then find that all tests pass. On a Unix-based system you will not need to reverse the slashes,
but spaces and other special characters will still need to be escaped. There is an additional complication that the % character is treated as a
special character in Windows scripts, and itself needs to be escaped, so you should actually use %%20 to obtain the desired behaviour (there can be a further complication
if you are calling out from a controlling script to launch the validator ie
call validate.bat where the outer script will do a first replacement of the escaped
character, and so within the call statement you would actually need to use %%%%20 to obtain the desired behaviour (this is passed to the call statement itself as %%20).
Performance of the CSV Validator is generally limited by the rate at which data can be read from storage. This is particularly true when the CSV Schema used includes checksum verification for files referred to in the metadata file, when the time taken to complete validation is almost entirely driven to read the files and compute the checksums.
The CSV file being validated is read line-by-line so the size of the file does not generally affect the memory required for processing,
even when handling CSV files of the order of 0.5 gigabytes in total size, and a million rows.
However, if the schema uses the
unique Column Validation Expression to ensure that a given column
(or combination of columns) has a unique value for each row of the CSV file, then a hash map is built by the validator to check the uniqueness.
With this constraint, and a CSV file of the order of 0.5 gigabytes, and a million rows,
it may be necessary to increase the Max Heap Size available to the Java Virtual Machine so that validation can complete.
If the heap is not sufficient, validation will stall, and after a period of time Java will return an error message similar to:
java.lang.OutOfMemoryError: Java heap space."
Changing the heap size involves setting the -Xmx option, either via system settings (this will probably require administrative rights on the machine),
or by amending the the .bat or shell script used to launch the validator. Search within the relevant file for
immediately after that, insert a space and something like
-Xmx3072m - both these example set the heap size to 3 gigabytes.
The largest value that has been found necessary at The National Archives is 6 gigabytes. Max heap size should always be less than the available system memory.
System restrictions on 32 bit versions of Windows can severely restrict the actual maximum that will be applied.