CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  1. Getting Started
  2. Organizing Inbound Data
  3. How Data Progresses Through CsvPath Framework
  4. Staging

Handling Variability

Named-files have to account for several things that are important to data operations:

Reason for variability
Example

Business process change over time

May's orders are not the same as June's orders so they probably go in either a different file name or a different directory or both name and directory are different. Likewise, Q1 orders are not the same as March orders.

File name changes that humans make to track file contents or system changes

May's orders may be in 2025-05-30-orders.csv; whereas, June's orders could be in 2025-06-30.csv or even 2025-06-30-SAP.csv, if a new system was switched on.

Revisions to the content of files that are notionally the same

The books may have closed on April 2025 orders too soon, resulting in the production of a second 2025-04-30.csv file with slightly different data.

Sets of files that collectively make up a single unit of data

For practicality, Q1 orders for EMEA may be too large for a single data file.

Multiple downstream readers may need different cuts of the data

Again, for practical reasons, we might want to split up the March orders for fifteen US regions so that compensation calculations and Sales decision analytics can each get just the information they need.

And there may be other reasons or similar cases such as these.

CsvPath Framework starts from the premise that doing everything as simply as possible and in exactly the same way will be the most efficient and have the lowest risk. But it recognizes that the World isn't that simple. In reality there are good reasons to go beyond named-files simply referring the the most recently registered file and, instead, take account of all the variability we typically see.

To deal with variability in our source named-files we need two things: 1.) templates that organize named-files layouts, and 2.) query-like references that allow us to pick out the files we want to validatem upgrade, and publish.

PreviousData IdentityNextTemplates

Last updated 2 months ago