Named-file Reference Queries

The simplest use of CsvPath is to simply stick with names to reference files, groups of csvpaths, and results. But sometimes using a simple name is not enough. When you get to that point you need to use CsvPath's Reference Language.

CsvPath's default is to return the most recent file or results by name. Templates add a layer of complexity, but even then, asking for a file or results by name works great if you just need that most recent item. However, with or without the use of templates, you sometimes need to select data files in a dynamic way. There several typical reasons:

  • You may need to rerun a file for testing or to account for a bug

  • Maybe you missed a run

  • You are running multiple files at once

  • There are multiple systems using different files from the same named-file folder tree

The way we select files dynamically is to use references. References are essentially queries in the simple XPath-like CsvPath Reference Language. Like XPath, CsvPath Reference Language is a way to pick out resources according to their location within a dataset.

Where are references used?

References are mainly used to pick out files within named-files, individual csvpaths in named-paths groups, or named-results. Their form is:

$root.datatype.name_one.name_two

The four sections are:

  • Root: a named-file name, named-path name, or named-result name

  • datatype: an indicator of the type of data we're looking for

  • name_one: the most important id/name/date, etc. we're pointing to

  • name_three: a secondary id/name/date, etc. that helps determine what the reference is to

Wait, name_three? What happened to name_two? We're coming to that below. Spoiler, name_one and name_three can be split to create a name_two and name_four. But that feature is not often used at this time.

The datatypes are like XPath's axes or a scope, they tell you where the root name is. Effectively they determine what kind of reference you are dealing with. The main datatypes are:

  • files: for references to a named-file

  • csvpaths: for references to a named-paths group

  • results: for references to named-results

Secondarily, there are datatypes to indicate the internal data structures of a running csvpath:

  • variables: to give access to a csvpath's variables

  • metadata: to pull user-defined metadata fields

  • csvpath: for access to runtime stats like the match count or line number

  • headers: to access the current header values, line by line

These latter four datatypes are primarily used "locally" from within a csvpath referring to data of the same csvpath. However, references to variables from different runs is a useful non-local use of these types. Local references forego the root name and instead start with just $., i.e. dollar sign dot.

References use "pointers" to fill in parts of name_one and name_three. A pointer looks like a colon followed by a word or number. Using a pointer you can complete a date, provide an index, indicate a day, etc. The pointers are:

  • :all — returns all matches. :all is the default behavior. Using :all is just more explicit.

  • :from — used in the csvpaths datatype to indicate a run should start from a certain csvpath

  • :to — like :from, but indicating the run should stop with a certain csvpath

  • :before — selects all files registered before a date

  • :after — selects all files registered after a date

  • :first — selects the first file in a set of matching files

  • :last — selects the most recent or last match

  • :yesterday — converts to the datastamp of 0:00:00 on the previous day

  • :today — converts to the timestamp of 0:00:00 on the present day

  • :n (n = any integer from 0 to 99) — indicates which match to return out of a set of matches

Examples

Named-file references

$orders.files.EMEA/annual/20:first

This points to the orders named-file. It looks for a dataset in or below the EMEA/annual folder. 20 may indicate a folder that has a name starting with 20. The reference would also match a filename starting with 20. Of all the registered files found, the first version registered is selected.

$orders.files.EMEA/annual/20:all

By default a reference will return all items unless there is a pointer indicating only a single match should be returned; e.g. :last returns a single match, if any. The default return, therefore, is :all. But in some cases you might want to be explicit and include :all, even if you don't have to.

$orders.files.:5

This path returns the sixth file registered under the named-file name. (Remember that indexes are 0-based).

$orders.files.:today:1

:yesterday and :today are stand-ins for dates to make dynamic references a bit easier. Using the actual date is also pretty straightforward. This reference pulls all the files registered under the orders named-file and returns the second one.

$orders.files.a0de7c859e9058d5e05784b49c7d426cc5844359255aa143b44832f339a8b055

This reference matches a file by the SHA256 hash value of its content. It is an exact match. No two hash values are alike.

$orders.files.2025-01-01_14-30-00:before

This reference returns all the files registered before Jan 1, 2025 at 2:30 PM UTC. The match on the datetime is progressive, meaning that any part of the datetime you don't specify will be added in 0s.

$orders.files.acme/2025:all.2025-03-21_:before

Another reference returning any number of files before a date. This time we're scoping the datetime limitation to those files that have a named-files path beginning with acme/2025. To be clear, the reference would match any file located at staging/orders/acme/2025-EMEA-invoices.csv and registered before the start of March 21, 2025, assuming staging is the name we configured for our named-files area. It might match other filenames within the orders named-file as well, for e.g. staging/orders/acme/2025-US-invoices.csv.

Last updated