CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • The Parts Of a Reference
  • Seven Types Of Data
  • The Csvpath Runtime Fields
  • The Metadata Fields
  1. Topics

The Reference Data Types

PreviousThe ModesNextManifests and Metadata

Last updated 3 months ago

CsvPath uses a namespace-like path to point to data in various places. These are called references. References are integrated into the match components, print output, and the structure of a csvpath. If you want to do lookups from one csvpath to the results or metadata of another, you use a reference. When you need to print data from the print() function, you need references.

The Parts Of a Reference

A reference has this structure:

$paths-name.data-type.name.child

Let's break this down a bit more.

Part
Description
Example

$

The root of the csvpath

paths-name

The name of a group of csvpaths or a named-file. This is referred to as a named-paths name or a named-file name. In print() statements the name can be empty to indicate the currently active csvpath the reference is in.

  • $test.csv[*][yes()]

  • $mypaths.variables.my_variable

  • $.variables.my_variable

type-of-data

  • csvpath

  • csvpaths

  • files

  • headers

  • metadata

  • results

  • variables

$mypaths.metadata.description

name-of-data-item

Any name. In the case of headers the name can be quoted or can be the index of the header. In the csvpaths type the name is the identity of a specific csvpath within the named-paths group.

  • $mypaths.headers."my header"

  • $mypaths.headers.0

tracking value name

This is called a tracking value. Tracking values are keys in dict variables. In the case of references they can also be an index into a stack() variable.

  • $mypaths.variables.cities.Boston

Seven Types Of Data

The seven data types are always the second component of a reference. Their position in the reference is: $root.datatype.name.name.

The types are pretty simple.

  • csvpath is either runtime data about the current csvpath or it is post-run residual data about another named-paths group the reference is pointing to

  • csvpaths is the namespace for the identities of the individual csvpaths in the named-paths group.

  • headers are headers. The header names and indexes are available post-run. The data associated with the headers, line-by-line, may be available or not, depending on if the run method captured data. At this time CsvPath doesn't offer a way for a reference to point to a header value in an individual row.

  • metadata is descriptive data about the csvpath the reference is pointing to

  • variables are variables. Variables from completed runs are available from the CsvPath that the reference points to. We only lose the variables when the Python instance shuts down.

The Csvpath Runtime Fields

The csvpath data type's fields include:

  • stopped — True if the csvpath stopped the CsvPath from processing using the stop() function. Stopping a CsvPath that is run by a CsvPaths instance does not affect any other CsvPath instances that the parent CsvPaths is also running.

  • failed — True if the csvpath failed the CSV file using the fail() function. A CsvPath instance that enters the failed state continues to process lines until the end of the CSV file or until the csvpath stops the run by calling the stop() function.

  • delimiter — the CSV file delimiter. By default a ","

  • quotechar — the CSV file character used to quote header names and values

  • The parts of the csvpath as their original text strings:

    • scan_part — something like $myfile[1-10+20-30]

    • match_part — something like [concat("validation", "is", "good")]

  • The counts of lines, total lines, matches, and scans

    • count_matches — the 1-based count (all counts are 1-based) of the matches that have happened so far in the scan

    • count_lines — the 1-based count of the number of lines seen. This is also referred to as data lines because by default CsvPath skips blanks. You may also see references to "physical" lines. Physical lines means the number of line feeds in the file, regardless of if they create blank lines.

    • count_scans — the 1-based count of lines seen by the scan so far. If the scan is for 1+3+5 and CsvPaths is at line 3 the count will be 2.

    • total_lines — the 0-based count of all the physical lines in the file. This number was found before the first line is scanned.

  • The validation failed and run stopped properties

  • Basic timing:

    • line_time — the cumulative time processing lines so far

    • last_line_time — the time spent processing the line before the current line

  • headers — a string created from the currently set headers. This is largely for debugging. Keep in mind that the headers can be reset on demand using the reset_headers() function. Resetting headers is fairly common due to the irregular way CSV files are often constructed.

The Metadata Fields

The metadata fields come from the comments around a csvpath and from the CsvPath files, paths, and results managers.

Metadata's most important contribution is the identity of a csvpath. You set the identity of a csvpath by adding an ID or name field to a comment above or below the csvpath. The ID can be like:

id: my id

ID: my id

Id: my id

All three forms will be recognized. If not found, the same forms of the metadata key name are looked for. The identity is used for importing csvpaths using import(). It is also used by header references and for traceability in validation printouts and logging.

The other metadata coming from the managers includes:

  • paths_name — the named-paths name

  • file_name — the named-file name

  • data_lines — the count of total data lines. This is a 1-based count of all lines with data. It is set before the first line.

  • csvpaths_applied — the number of csvpaths that will be applied to the CSV file, all keyed under paths_name

  • csvpaths_completed — the number of csvpaths completed to that point. This number is static after a run is complete. At that point csvpaths_completed may not equal csvpaths_applied if there are csvpaths that were stopped by the csvpath itself using the stop() function.

  • valid — tells us if the file is considered valid according to all the paths applied so far.

Much of the information above is available less conveniently from other sources. More importantly, csvpath comments can provide user defined keyed-metadata values. These are similar to tags in AWS, GCP, and Azure. Using metadata fields in your comments can be a huge win for the long-term maintainability of large csvpath collections. Keys take the form of a word with a colon at the end. For example:

~ name: Order Batch File Valdiations
  description: The orders file arrives nightly between 1 - 3 a.m.
~ 

This comment would result in two entries in the metadata collection. One for the name and the other under the key description. You can add a colon to end a metadata field without starting a new one, like this:

~ name: header reset import
  description: this csvpath is used to reset the headers if they change :
  more testing is needed ~   

In that example there are two metadata fields: name and description. The additional comment, more testing is needed, is not captured in the metadata fields because the description field was closed with a following colon and no new field was started. In this way, name and description are machine-readable, while more testing is needed is only for humans.

files indicates that the reference is

results references point to the data.csv intermediate results of the csvpaths in a named-paths group. Each csvpath's data.csv is automatically collected (unless configured not to) and positioned in a standard location so that it can be referenced and piped into other csvpaths. You can .

The Parts Of a Reference
Seven Types Of Data
The Csvpath Runtime Fields
The Metadata Fields
a pointer to a named-file version as described here
read more about results references here