CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • The Archive
  • Named-files Inputs
  • Named-paths Inputs
  • Picking your location
  1. Topics
  2. Data, Validation Files, and Storage

Storage Backends

You have a few simple choices for where to store your assets

PreviousWhere Do I Find Results?NextFile Management

Last updated 3 months ago

CsvPath Framework stores data and csvpath files in three locations that it manages for you:

  • The archive

  • Data files

  • Csvpath files

These locations are settable in config/config.ini. By default the archive is at ./archive. The defaults for data files and csvpath files are ./inputs/named_files and ./inputs/named_paths, respectively.

You can set these three areas to point to locations in four types of storage. (With more options on the way!)

  • The local filesystem

  • AWS S3

  • An SFTP server

  • Azure Blob Storage

  • Google Cloud Storage

The Archive

The archive serves two purposes:

  • It is a namespace that allows separate CsvPath Framework projects to group their data publishing or separate their data. You can name a CsvPath project's archive anything you like.

The archive is organized in a hierarchy like this:

All your results go to the archive. The metadata collected characterizes the data and its processing completely, so you can easily tell if a set of results are known-good or known-bad. Either way, you have a complete record for the purposes of traceability and explainability. Also, keep in mind that the archive can be a data source for csvpaths that are unrelated to the process of creating it.

Named-files Inputs

Named-files are the untrustable source data that the CsvPath Framework identifies, validates, upgrades, and stages in the archive for downstream use. Named files are one-name, one-file — unlike named-paths where the name is applied to a group of csvpaths. Why is it useful to name a file? There are several advantages:

  • A name represents a changing data stream, allowing a process to address new data in a uniform way over time

  • Less important, but a factor: names can be shorter and more memorable than a complete file path

The first bullet has three aspects:

  • Changing versions

  • Changing file names

  • Serialized data delivery

CsvPath Framework versions your input data files. What it does is simpler than Git and other version control systems. CsvPath just stores data immutably and adds a new file when it gets new bytes. The structure is:

You can imagine a company that receives orders files having an orders named-file tree that looks like this:

You can see all three of the ways CsvPath Framework expects named-file data to change. First, there is a progression in time from left to right: march, april, may. Second, in april-2025.csv you can see two versions of the data with at least one byte's difference between them. And third, in May the company changed how it names files and seems to have begun breaking the dataset down by region. As each of these files arrives (the lowest level boxes represent the actual data files — april-2025.csv is actually a directory within CsvPath's files tree) it is handled as the orders file. As the current orders file it has the same csvpaths applied to it in exactly the same way.

Named-paths Inputs

Named-paths live in a similar directory structure to data files. Each named-paths name identifies some number of csvpaths that are run against data files as a group. Having csvpaths in groups has several advantages:

  • Validations, canonicalization, and upgrading can be broken down into small testable and composable steps

  • We can turn individual csvpaths on and off, or make a number of other settings, so that each csvpath can run with the settings that serve a specific purpose.

  • The data results can be separated out, chained into flows, or reused by other csvpaths using references

The first bullet is the big one. Imagine a data analyst that has to check data against a defense acquisition CSV or Excel validation standard running to hundreds of pages. Yes, that's a thing. They might not put all their validations in one named-paths group, but you can imagine a named-paths group where each csvpath validated data against one rule. There could easily be hundreds of csvpaths in that one group. Sounds terrible, right? But imagine trying to validate using one csvpath for hundreds of rules. That would be much so much worse in every way!

Picking your location

Now, for all that, the actual thing we want to do here — choose and configure where we put our files — turns out to be super easy. Open your config/config.ini file. You should see a [results] section and an [inputs] section. With in [results] there is an archive key. It takes a path to your archive. And within [results] there is a file key and a csvpaths key. Those point to your file inputs. Use relative or fully qualified file system paths. For Azure, S3, GCP, and SFTP use URI form locations like:

  • s3://csvpath-example-1/named_files

  • sftp://my-server/csvpath/archive

  • azure://csvpath/storage/named_paths

  • gs://csvpath_def/trusted_publisher_archive

Each of these three config.ini keys can point to a different backend. You can mix and match the filesystem, S3, SFTP, the file system, Google, and Azure any way you like. The only constraint is the additional latency of moving storage from the local hard disk to a remote backend. Test to make sure you're good with the latency, given your use case. If you need to, consider moving your compute closer to your storage. For e.g., you could choose to put CsvPath Framework into an AWS Lambda or a Fargate container.

Over time, CsvPath Framework will probably support more backends. For many use cases the network storage options you have already today are super easy and effective. Give them a try!

There are . The screenshot above kind of gives away the how-to secrets, tho. First, let's revisit what the locations are and how they work.

It is the , an immutable and idempotent archive of known-good and known-bad data, serving downstream data consumers

more how-to notes here
trusted publisher