The Reference Data Types
Last updated
Last updated
CsvPath uses a namespace-like path to point to data in various places. These are called references. References are integrated into the match components, print output, and the structure of a csvpath. If you want to do lookups from one csvpath to the results or metadata of another, you use a reference. When you need to print data from the print()
function, you need references.
A reference has this structure:
Let's break this down a bit more.
Part | Description | Example |
---|---|---|
The six data types are always the second component of a reference. Their position in the reference is: $root.
datatype
.name.name
.
The types are pretty simple.
variables
are variables. Variables from completed runs are available from the CsvPath that the reference points to. We only lose the variables when the Python instance shuts down.
headers
are headers. The header names and indexes are available post-run. The data associated with the headers, line-by-line, may be available or not, depending on if the run method captured data. At this time CsvPath doesn't offer a way for a reference to point to a header value in an individual row.
csvpath
is either runtime data about the current csvpath or it is post-run residual data about another named-paths group the reference is pointing to
csvpaths
is the namespace for the identities of the individual csvpaths in the named-paths group.
metadata
is descriptive data about the csvpath the reference is pointing to
results references point to the data.csv intermediate results of the csvpaths in a named-paths group. Each csvpath's data.csv
is automatically collected (unless configured not to) and positioned in a standard location so that it can be referenced and piped into other csvpaths.
The csvpath
data type's fields include:
stopped
— True
if the csvpath stopped the CsvPath from processing using the stop()
function. Stopping a CsvPath that is run by a CsvPaths instance does not affect any other CsvPath instances that the parent CsvPaths is also running.
failed
— True
if the csvpath failed the CSV file using the fail()
function. A CsvPath instance that enters the failed state continues to process lines until the end of the CSV file or until the csvpath stops the run by calling the stop()
function.
delimiter
— the CSV file delimiter. By default a ","
quotechar
— the CSV file character used to quote header names and values
The parts of the csvpath as their original text strings:
scan_part
— something like $myfile[1-10+20-30]
match_part
— something like [concat("validation", "is", "good")]
The counts of lines, total lines, matches, and scans
count_matches
— the 1-based count (all counts are 1-based) of the matches that have happened so far in the scan
count_lines
— the 1-based count of the number of lines seen. This is also referred to as data lines because by default CsvPath skips blanks. You may also see references to "physical" lines. Physical lines means the number of line feeds in the file, regardless of if they create blank lines.
count_scans
— the 1-based count of lines seen by the scan so far. If the scan is for 1+3+5 and CsvPaths is at line 3 the count will be 2.
total_lines
— the 0-based count of all the physical lines in the file. This number was found before the first line is scanned.
The validation failed and run stopped properties
Basic timing:
line_time
— the cumulative time processing lines so far
last_line_time
— the time spent processing the line before the current line
headers
— a string created from the currently set headers. This is largely for debugging. Keep in mind that the headers can be reset on demand using the reset_headers()
function. Resetting headers is fairly common due to the irregular way CSV files are often constructed.
The metadata
fields come from the comments around a csvpath and from the CsvPath files, paths, and results managers.
Metadata's most important contribution is the identity of a csvpath. You set the identity of a csvpath by adding an ID or name field to a comment above or below the csvpath. The ID can be like:
id: my id
ID: my id
Id: my id
All three forms will be recognized. If not found, the same forms of the metadata key name
are looked for. The identity is used for importing csvpaths using import()
. It is also used by header references and for traceability in validation printouts and logging.
The other metadata coming from the managers includes:
paths_name
— the named-paths name
file_name
— the named-file name
data_lines
— the count of total data lines. This is a 1-based count of all lines with data. It is set before the first line.
csvpaths_applied
— the number of csvpaths that will be applied to the CSV file, all keyed under paths_name
csvpaths_completed
— the number of csvpaths completed to that point. This number is static after a run is complete. At that point csvpaths_completed may not equal csvpaths_applied if there are csvpaths that were stopped by the csvpath itself using the stop()
function.
valid
— tells us if the file is considered valid according to all the paths applied so far.
Much of the information above is available less conveniently from other sources. More importantly, csvpath comments can provide user defined keyed-metadata values. These are similar to tags in AWS, GCP, and Azure. Using metadata fields in your comments can be a huge win for the long-term maintainability of large csvpath collections. Keys take the form of a word with a colon at the end. For example:
This comment would result in two entries in the metadata collection. One for the name
and the other under the key description
. You can add a colon to end a metadata field without starting a new one, like this:
In that example there are two metadata fields: name
and description
. The additional comment, more testing is needed
, is not captured in the metadata fields because the description
field was closed with a following colon and no new field was started. In this way, name
and description
are machine-readable, while more testing is needed
is only for humans.
$
The root of the csvpath
paths-name
The name of a group of csvpaths. This is referred to as a named-paths name. In print()
statements the name can be empty to indicate the csvpath the reference is in.
$test.csv[*][yes()
]
$mypaths.variables.my_variabl
e
$.variables.my_variable
type-of-data
variables
headers
csvpath
csvpaths
metadata
results
$mypaths.metadata.description
name-of-data-item
Any name. In the case of headers the name can be quoted or can be the index of the header. In the csvpaths
type the name is the identity of a specific csvpath within the named-paths group.
$mypaths.headers."my header
"
$mypaths.headers.0
tracking value name
This is called a tracking value. Tracking values are keys in dict
variables. In the case of references they can also be an index into a stack()
variable.
$mypaths.variables.cities.Boston