CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  1. Getting Started
  2. Organizing Inbound Data
  3. The Three Data Spaces

Source Staging

PreviousThe Three Data SpacesNextValidation Assets

Last updated 2 months ago

CsvPath Framework collects all inbound files into a staging area. This area is:

  • A permanent immutable record of all versions of inbound files

  • The source for the validation and upgrading engine

  • Available for inspection by individuals triaging downstream problems

  • Accessible by any systems that don't want the validation, upgrading, and metadata that CsvPaths Framework runs offer. (We anticipate the number of such ambivalent systems is approximately 0, but, still, the access to raw source files is available)

The source staging area can mirror any current directory layout. The "Path to file" box in the diagram above represents any file system structure you like. The structure is defined on a named-file by named-file basis using a template. We cover templates later in this documentation.

The "File name (as a directory)" box is just what it says: a directory named for a source file. E.g. if an inbound raw source file is named 2025-apr-01-sales-emea.csv, it lives in a directory named 2025-apr-01-sales-emea.csv.

The actual file's bytes live in files named by SHA256 hash values. These hash fingerprints are unique to the exact content of a version of the file. If a new copy of 2025-apr-01-sales-emea.csv arrives a day later with 1 character different from the original file, CsvPath Framework stores the new version in a file named by the new unique hash of the new content.

The named-file name is an abstract name like orders or EMEA-orders or Q2-orders-Acme-EMEA. It is whatever you like. The path within the named-file is constructed according to a template that is based on the path where MFT received the file. That means there can be multiple paths within the named-file name. Likewise, the name of the data file is likely to change. CsvPath Framework captures the new name and its new hash fingerprint.

The abstract named-file name can be used stand-alone in starting a run. When you do that, CsvPath assumes you mean the most recent file that was registered with that name.

Alternatively you can refer to a named-file name with the full path to the filename. You can also use a partial path to find one or more files. A partial path can have pointers to dynamically find a version of one or more files registered with the named-file name at a location and/or within an arrival window.

We will explain how this flexibility works and is helpful later in these docs.

Finally, the named-file directory contains a manifest.json that tracks arrival times, identities, and other automatically generated metadata.

The source staging area is for named versions of raw inbound files