CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  1. Topics

Namespacing With the Archive

PreviousSerial Or Breadth-first Runs?NextGlossary

Last updated 5 months ago

The archive is concept you will first meet as a top-level directory. It is where your results end up when you use CsvPaths instances to run your csvpaths. By default the archive is at ./archive.

You can set the archive location in your config.ini file. Your config file is, again by default, at ./config/config.ini. In it, look for the [results] section and archive key. In a newly generated config.ini it will look like this:

You can of course make the archive live anywhere and be called anything. What is more important is understanding the two main things the archive offers:

  • Long-term immutable storage for versioned releases -- basically, being an archive!

  • A means of namespacing results

One of the decisions you will need to make in setting up CsvPath is how closely coupled you want your operations to be across data sources and destinations. There's no right or wrong answer. If you keep all your data partner assets together in the same install you have all the same core benefits of the Collect Store Validate Pattern as you would if you created one install per data partner.

But at the margin, multiple installs, one per data partner, can make things easier to operate. Creating a new installation is super easy. pick a location, use Pip, Poetry or another Python tool to setup a project. Generally it's a one-liner like poetry new my_project. The new project has an archive directory, or will after your first CsvPaths instance run. By naming that archive uniquely you create a namespace specific to a certain data operation.

So it's easy to namespace your work with different data partners. But is it really helpful? In many cases, it is nether good nor bad. If all your data partners' results go in the same archive you will have more directories in one place. That typically means more directories on one physical drive or in one object store bucket. But what else?

  • A job for data staging (add_named_file)

  • A job for csvpath loading (add_named_paths)

  • A job for running a named-paths group (e.g. collect_by_line)

  • One job for every csvpath in the named-paths group

  • And events for the states all of these pass through as the operate on data

Visually and in terms of listed events, that's a lot of information. Well structured, highly consistent information, for sure, but still a lot. Here is a simple run in Marquez. The information presented would be similar for any OpenLineage server.

But remember, the second box from the right, labeled Group:transfer.Instance:1, represents just one of any number of csvpaths in the named-group named transfer. And each of those csvpaths has up to six standard output files (data.csv, unmatched.csv, errors.json, vars.json, printouts.txt, meta.json), a manifest.json, and any number of ad hoc transfer destinations. All of which will be viewable in the UI. Even in a list view, it can be a lot.

To make it easier to narrow down your view, for reasons of good compartmentalization (narrow the "blast radius") or load-sharing between teams or any other need, you can use separate archives to partition your data partners' assets. The result is multiple archive folders and, in Marquez, a drop-down selection like this on in the top left corner:

Here you can see that the TinPenny namespace contains our assets. The other two namespaces shown, archive and default, presumably have other assets that are unrelated to TinPenny's. This narrower scope is much easier to work with. And using Marquez's date-windowing and search makes things even more straightforward.

You can reset your archive directory in config.ini at any time. Go ahead, give it a try.

The first role—that of archiving results—. The role of namespacing is also extremely important, at least for larger CsvPath implementations.

Well, if you're using and , it means that all your events are namespaced the same way. That's not a bad thing, really, because Marquez's search is terrific. But it is a lot of information to wade through when you are browsing or checking regular runs. Keep in mind that your CsvPath lineage data includes:

is covered here
OpenLineage
Marquez
A list view of the "Another Example" csvpaths