CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • Data files
  • Csvpath validation files
  • The results files
  1. Topics
  2. Data, Validation Files, and Storage

File Management

CsvPath Framework is a DataOps infrastructure framework. Using it for just its delimited data validation language is totally reasonable for some use cases, but doing so leaves a lot of advantages on the table.

Using the Framework as automation infrastructure requires only that you work with instances of the CsvPaths class, rather than just the CsvPath class. We use CsvPath instances directly only for one-off validations. If you haven't already been using CsvPaths instances, it is a small change that makes a big difference.

CsvPaths applies csvpath statements to delimited files and stores the results for further use. This page explains how and where data and validation files are managed.

There are three file sets:

  • Delimited data files

  • Csvpath validation files

  • Results data and metadata

Each of these types is managed in a specific location in a structured way. CsvPaths allows you to put your files wherever you like, but when you register files for use they are imported into a structured environment.

It is worth pointing out that while your CsvPaths instance's results files are created for your use, the data files and csvpath validation files areas are for the library's internal use, not yours. You should never need to touch these files. Nevertheless, it is good to know how your data and validation rules are being managed.

Data files

CsvPaths's validation is applied to named-files. Named-files are just file paths that are referred to by simple names. E.g. you might have a file at

/my/data/files/quarterly-statement.xlsx

You would present that file to CsvPaths's FileManager like this:

csvpaths = CsvPaths()
csvpaths.add_named_file("quarterly", "/my/data/files/quarterly-statement.xlsx")

When you run this code CsvPaths copies /my/data/files/quarterly-statement.xlsx into its named-files area. By default the named-files location is ./inputs/named_files. Your file would live in the "home folder" for that name: ./inputs/named_files/quarterly. Within that home folder you would see a directory called quarterly-statement.xlsx.

Inside the quarterly-statement.xlsx directory you would find one or more .xlsx files with sha256 hash names. A sha256 is an cryptographic algorthm that fingerprints data. Its fingerprints look like: 12467d811d1589ede586e3a42c41046641bedc1c73941f4c21e2fd2966f188b4

This mathematical fingerprint is unique to quarterly-statement.xlsx's exact set of characters. When your CsvPaths instance imports your files it captures a version and names it with its fingerprint. If you update a file and reimport it you will see another hash-named file containing the new version. In this way CsvPaths keeps an exact identity for every file, version-by-version.

In your data file's home directory you will also see a manifest.json file. manifest.json contains the record of changes to the file. The contents look like:

[{
 "type": "csv",
 "file": "inputs/named_files/quarterly/quarterly-statement.xlsx/12467d811d1589ede586e3a42c41046641bedc1c73941f4c21e2fd2966f188b4.xlsx",
 "fingerprint": "12467d811d1589ede586e3a42c41046641bedc1c73941f4c21e2fd2966f188b4",
 "time": "2024-11-15 15:35:00.477177",
 "from": "tests/test_resources/named_files/quarterly-statement.xlsx"
 }]

Each time you make a change to quarterly-statement.xlsx and re-present it to CsvPaths's file manager a new object will be added to this JSON structure with information about the new version. The JSON object's file key ultimately identifies the physical file used to run a validation, like this:

paths = CsvPaths()
paths.collect_paths(pathsname="validations", filename="quarterly")

Here is a screenshot of the structure:

And again in text, the structure is like this:

Csvpath validation files

CsvPath Language validation files contain one or more csvpath validation statements separated by ---- CSVPATH ---- delimiters. You can create them as single files or single-csvpath files assembled by directory or JSON structure. However you do it, when you add your named-paths to a CsvPaths instance's PathsManager, your csvpath files are copied into a central area. By default CsvPaths's named-paths live at ./inputs/named_paths.

The named-paths structure is a bit simpler than named-files. Your paths changes are noted, but the files are not versioned. And all named-paths files under one name are stored as a single file called group.csvpaths. Like in named-files, a manifest.json lives in each name's home directory. manifest.json keeps track of file changes using this JSON structure:

[{
 "file": "inputs/named_paths/orders/group.csvpaths",
 "fingerprint": "262161605f8ad39d3a30cc2f9bb75f688ecad66e9b66cb9a3c9c928f7c899cb7",
 "time": "2024-11-15 15:34:35.130181"
}]

Each time you re-add your csvpaths under a named-paths name your CsvPaths instance will add another JSON object tracking the file fingerprint and time. The contents of the csvpath's scanning and matching parts is captured in each run's metadata. The date of the run can be easily compared to the dates the csvpaths were changed. We expect that your real version control will be done using Git or another configuration management system, as you would do with any other development assets.

You may also notice other JSON files called definition.json in the named-paths home directories. definition.json is captured when you use JSON to define your named-paths group(s). It is just a stright capture of the JSON copied and renamed definition.json, with no other modifications. That means if you have a JSON file that defines three named-paths groups each of the named-paths group home directories will have that same JSON file copied in as definition.json, and each copy will have the complete JSON for all three named-paths groups.

{
    "orders": [
        "tests/examples/example_2_2/csvpaths/metadata.csvpath",
        "tests/examples/example_2_2/csvpaths/file_length.csvpath",
        "tests/examples/example_2_2/csvpaths/reset.csvpath",
        "tests/examples/example_2_2/csvpaths/categories.csvpath",
        "tests/examples/example_2_2/csvpaths/prices.csvpath",
        "tests/examples/example_2_2/csvpaths/sku_upc.csvpath"
    ],
    "top_matter_import": [
        "tests/examples/example_2_2/csvpaths/top_matter_import.csvpath"
    ]
}

Both the orders and the top_matter_import named-paths home directories received their own identical copy of the above JSON in a definition.json.

Here is the structure of the named-paths area:

And in text form, the structure is:

The results files

PreviousStorage BackendsNextLanguage Basics

Last updated 3 months ago

For example, a JSON named-paths definition from the how-to article looked like:

The files created as the results of running named-paths against a named-file are super important. And there are more of them with more choices and opportunities for your work approach. .

Another Example
Read all about results files here