CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • Every data onboarding process starts with a data preboarding stage
  • So How Does Preboarding Work?
  • You have options!
  1. Getting Started

Organizing Inbound Data

Organizing your preboarding data layout upfront makes the onboarding process efficient

PreviousQuickstartNextDataflow Diagram

Last updated 2 months ago

Every data onboarding process starts with a data preboarding stage

Preboarding is:

  • Data collection

  • Dataset identification (a.k.a. file registration)

  • Validation

  • Upgrading

  • Staging and archiving

After data is preboarded it is considered known and trustworthy and ready for ETL/ELT, workflow processes, use in applications, and the data lake.

All onboarding processes include these data preboarding steps. The only question is how effective, efficient, and low-risk their implementation is. Often companies under-invest in the preboarding stage, resulting in manual validation and handling, more support issues, and rework. Shortcuts in preboarding have substantial long-term costs to the business.

Underinvestment in preboarding exposes your company to

  • Manual handling costs

  • Support time

  • Rework by developers

  • Refunds to annoyed customers

So How Does Preboarding Work?

Files arrive through a Managed File Transfer process. MFT is a big topic. It includes:

  • SFTP / FTPS

  • MFT servers providing a range of protocols and limited workflow support

  • AS2 / AS4 file transfer, often used with EDI-related files

  • Cloud buckets attached to cloud functions or other compute

  • etc.

In a few cases we see files dropped into a common area, differentiated only by file name or content. But that is rare. Files are typically stored in one of a few ways:

  • By time of arrival

  • By data partner

  • By target application

  • By jurisdiction

  • By transaction or business process

These organizing concepts will be layered on on the other. For example, an orders business process alignment may include date and sales region in a hierarchy like this:

These are file directories in a file system holding CSV files. Clearly, this data layout is going to make it easy to find all orders in 2025 but much harder to see all the files holding the different sales person orders. Another layout might provide easy access in a different way.

You have options!

How you arrange your data matters! Of course, at different times you may need different layouts. Since your data will eventually land in a database or warehouse or application, presumably your needs will ultimately be met. However, you have to consider how you store your data in preboarding in order to make onboarding and long-term reference to the source data efficient.

All of these approaches make sense in the right context
Preboarding is the critical first step of the data onboarding process
Inbound data layout is important. There are four broad approaches.
An example of one way to lay out inbound data files