CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • What is Data Preboarding?
  • How does the CsvPath Framework help?
  • How to get started

DATA PREBOARDING

Why we all need to care about data preboarding and the trusted publisher model

PreviousCsvPathNextQuickstart

Last updated 2 months ago

CsvPath is the leading tool for automated data preboarding. It is a purpose-built open source Python framework integrated with a wide variety of popular DataOps tools that acts as a trusted publisher between MFT and the data lake and applications.

What is Data Preboarding?

Data preboarding is the receiving process for external batch data. It is the first part of a robust data onboarding process. Preboarding assigns a durable identity, validates that the data meets expectations, upgrades it for productivity, and stages it in an immuable known-good archive for downstream consumers. Your data lake deserves a data publisher it can trust! Once data is preboarded it is no longer considered external.

Data preboarding may be a new term to you, or not; either way it is not a new concept. All data is preboarded on its way into the organization. The question is, how well does your onboarding process work? The experience of most companies is that the process is less reliable, holds more risk, and is much more expensive than is comfortable. Manual and error prone preboarding commonly diverts more than 2% of revenues to overhead. That's north of $20,000 per million or more than $20 million per billion in revenue. That adds up!

How does the CsvPath Framework help?

CsvPath is a drop-in replacement for rickety data landing zones. It is laser-focused on automated data preboarding. The Framework focuses on making the overall onboarding process efficient, fast, and safe by generating trustworthy data — and doing it in a way that scales operationally to any number of data partners. A company with one data partner needs effective preboarding. A company with a thousand data partners needs efficient preboarding that never fails. CsvPath Framework can help!

CsvPath brings many capabilities to the table:

  • An opinionated framework for collecting, identifying, validating and publishing data that enables you to spin up a new data partner project literally in seconds

  • Powerful schema and rules-based validation that has never before been available for delimited data

  • Explainability-focused metadata production that gives you the power to know exactly what happened as your data evolved

  • Out-of-the-box integrations for lineage tracking, observability, MFT (managed file transfer), and more

With CsvPath Framework you are signing up for a well-known pattern that settles the architecture and design questions up-front, leaving your team focused on data quality and accountability. And with CsvPath's the automation-forward approach, you can scale-down manual data quality efforts and scale up data throughput.

How to get started

Data pre-boarding is everywhere. And yet it is dramatically undertooled. We're on a mission to upgrade preboarding and make CsvPath Framework the world's trusted publisher. Welcome aboard!

If you are a developer, take a look at the and the exercises. They will get you up and running and introduce the CLI, the fastest way to get started. Reading about would be useful. Take a look the and sections. There is a and . And there is more information on the.

For a higher-level view on the topics of edge governance and data preboarding, try the . They are CsvPath focused, but speak to the overarching operational and organizational needs.

Quickstart
Your First Validation
schemas vs. rules-based validation
How-tos
DataOps integrations
cheatsheet
validation language basics
GitHub site
atesta analytics whitepapers
CsvPath is a pre-packaged automation-focused preboarding process that ends garbage-in-garbage-out.
A checklist of the capabilities of a preboarding architecture like CsvPath Framework: durable identification, validation, data upgrading, canonicalization, consistent immutable staging as a trusted publisher to downstream data users.
The CsvPath Framework logo
A super high-level data flow diagram showing how data files and validation/upgrading files are combined to create known-good data for downstream data consumers.