File Management
CsvPath can be used as an operational framework, as well as a validation language. Using it this way requires only that you work with instances of the CsvPaths
class, rather than the CsvPath
class used for one-off validations. It's a small change that makes a big difference.
CsvPaths
applies csvpath statements to delimited files and stores the results for further use. This page explains how and where data and validation files are managed.
There are three file sets:
Delimited data files
Csvpath validation files
Results data and metadata
Each of these types is managed in a specific location in a structured way. CsvPaths
allows you to put your files wherever you like, but when you register files for use they are imported into a structured environment.
It is worth pointing out that while your CsvPaths
instance's results files are created for your use, the data files and csvpath validation files areas are for the library's internal use, not yours. You should never need to touch these files. Nevertheless, it is good to know how your data and validation rules are being managed.
Data files
CsvPaths
's validation is applied to named-files. Named-files are just file paths that are referred to by simple names. E.g. you might have a file at
/my/data/files/quarterly-statement.xlsx
You would present that file to CsvPaths's FileManager like this:
When you run this code CsvPaths copies /my/data/files/quarterly-statement.xlsx
into its named-files area. By default the named-files location is ./inputs/named_files
. Your file would live in the "home folder" for that name: ./inputs/named_files/quarterly
. Within that home folder you would see a directory called quarterly-statement.xlsx
.
Inside the quarterly-statement.xlsx
directory you would find one or more .xlsx
files with sha256 hash names. A sha256
is an cryptographic algorthm that fingerprints data. Its fingerprints look like: 12467d811d1589ede586e3a42c41046641bedc1c73941f4c21e2fd2966f188b4
This mathematical fingerprint is unique to quarterly-statement.xlsx
's exact set of characters. When your CsvPaths
instance imports your files it captures a version and names it with its fingerprint. If you update a file and reimport it you will see another hash-named file containing the new version. In this way CsvPaths
keeps an exact identity for every file, version-by-version.
In your data file's home directory you will also see a manifest.json
file. manifest.json
contains the record of changes to the file. The contents look like:
Each time you make a change to quarterly-statement.xlsx
and re-present it to CsvPaths
's file manager a new object will be added to this JSON structure with information about the new version. The JSON object's file
key ultimately identifies the physical file used to run a validation, like this:
Here is a screenshot of the structure:
And again in text, the structure is like this:
Csvpath validation files
CsvPath Language validation files contain one or more csvpath validation statements separated by ---- CSVPATH ----
delimiters. You can create them as single files or single-csvpath files assembled by directory or JSON structure. However you do it, when you add your named-paths to a CsvPaths
instance's PathsManager
, your csvpath files are copied into a central area. By default CsvPaths
's named-paths live at ./inputs/named_paths
.
The named-paths structure is a bit simpler than named-files. Your paths changes are noted, but the files are not versioned. And all named-paths files under one name are stored as a single file called group.csvpaths
. Like in named-files, a manifest.json lives in each name's home directory. manifest.json keeps track of file changes using this JSON structure:
Each time you re-add your csvpaths under a named-paths name your CsvPaths instance will add another JSON object tracking the file fingerprint and time. The contents of the csvpath's scanning and matching parts is captured in each run's metadata. The date of the run can be easily compared to the dates the csvpaths were changed. We expect that your real version control will be done using Git or another configuration management system, as you would do with any other development assets.
You may also notice other JSON files called definition.json
in the named-paths home directories. definition.json
is captured when you use JSON to define your named-paths group(s). It is just a stright capture of the JSON copied and renamed definition.json
, with no other modifications. That means if you have a JSON file that defines three named-paths groups each of the named-paths group home directories will have that same JSON file copied in as definition.json
, and each copy will have the complete JSON for all three named-paths groups.
For example, a JSON named-paths definition from the Another Example how-to article looked like:
Both the orders
and the top_matter_import
named-paths home directories received their own identical copy of the above JSON in a definition.json
.
Here is the structure of the named-paths area:
And in text form, the structure is:
The results files
The files created as the results of running named-paths against a named-file are super important. And there are more of them with more choices and opportunities for your work approach. Read all about results files here.
Last updated