CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  1. Getting Started
  2. Organizing Inbound Data
  3. The Three Data Spaces

Trusted Publishing

PreviousValidation AssetsNextHow Data Progresses Through CsvPath Framework

Last updated 2 months ago

We refer to the trusted publishing space as the Archive. By default the Archive is at ./archive, but you can name it anything you like and house it on any of the backend storage systems CsvPath Framework supports.

The Archive is the richest, most flexible, and most complex of the data file areas in CsvPath Framework. It is a kind of namespace. In large DataOps groups or companies with many data partners we would recommend using multiple archives in order to make your data estate more manageable. If you use multiple trusted publishing archives they can live together in one backend system or be split over multiple systems of different types.

Within the Archive the first level is a list of named-results. A named-result takes the same name as the named-paths group that generated it. When you ask CsvPath Framework for the named-results using just the plain named-results name, it gives you the results from the most recent run.

Within a named-results directory you can have a set of run directories. (Sometimes we refer to these as runs, run_dirs, or run homes). Alternatively, you can optionally define a directory structure within a named-results directory that helps you organize results according to how you want to access your data. The organization is defined using the template stored with the named-paths group.

Within the template-defined organizational folder you come to a set of run_dirs named with a datetime. The datestamp in the form %Y-%m-%d_%H-%M-%S_nn, where the optional nn is a number from 0 to 99 that disambiguates any runs that have the same datestamp. For example: 2025-02-28_14-32-59_0.

Optionally, the named-results layout template can have one or more suffix directories below the run_dir. These would only add value only if they help a downstream system better identify what data is available in the run without other downstream data consumers having to browse a larger directory structure above the run_dir.

Below the suffix directories, if any, are the individual csvpath results directories. (These are sometimes referred to as instance directories). There is one instance result directory for each csvpath in the named-paths group that generated the named-results. And collectively these instance directories hold the final output of the run.

CsvPath Framework generates up to seven common files for each csvpath instance. They are:

File
Purpose

data.csv

Holds the matched rows from the file being validated and/or upgraded

unmatched.csv

Contains rows that didn't match. These rows could be considered valid, invalid, or just extra, depending on how the run is setup.

vars.json

CsvPath Framework allows you to set and use variables in your csvpath statements. These are somewhat similar to session variables in an application server.

meta.json

Contains user-defined metadata, per-csvpath configuration settings, and runtime csvpath info.

errors.json

A set of detailed error event dictionaries that give user-friendly error messages and developer-friendly data about any validation or csvpath errors.

printouts.txt

Any number of user-defined print statement streams are collected in a multi-section output file.

manifest.json

CsvPath Framework-generated metadata about the run the results come from

We will dig into these results outputs in many other parts of this documentation.

The trusted publishing area is the most flexible and content rich