DATA PREBOARDING
Why we all need to care about data preboarding and the trusted publisher model
Last updated
Why we all need to care about data preboarding and the trusted publisher model
Last updated
CsvPath is the leading tool for automated data preboarding. It is a purpose-built open source Python framework integrated with a wide variety of popular DataOps tools that acts as a trusted publisher between MFT and the data lake and applications.
Data preboarding is the receiving process for external batch data. Preboarding assigns a durable identity, validates that the data meets expectations, upgrades it for productivity, and stages it in an immuable known-good archive for downstream consumers. Your data lake deserves a data publisher it can trust! Once data is preboarded it is no longer considered external.
Data preboarding may be a new term to you, or not; either way it is not a new concept. All data preboarded on its way into the organization. The question is, how well does the process work? The experience of most companies is that the process is less reliable, holds more risk, and is much more expensive than is comfortable. Manual and error prone preboarding commonly diverts more than 2% of revenues to overhead. That's $20,000 per million or $20 million per billion in revenue. That adds up!
CsvPath is a drop-in replacement for rickety data landing zones. It is laser-focused on automated data preboarding. The Framework focuses on generating trustworthy data, and doing it in a way that scales operationally to any number of data partners. A company with one data partner needs effective preboarding. A company with a thousand data partners needs efficient preboarding that never fails.
CsvPath brings many capabilities to the table:
An opinionated framework for collecting, identifying, validating and publishing data that enables you to spin up a new data partner project literally in seconds
Powerful schema and rules-based validation that has never before been available for delimited data
Explainability-focused metadata production that gives you the power to know exactly what happened as your data evolved
Out-of-the-box integrations for lineage tracking, observability, MFT (managed file transfer), and more
With the CsvPath Framework you are signing up for a well-known pattern that settles the architecture and design questions up-front, leaving your team focused on data quality and accountability. And with CsvPath's the automation-forward approach, you can scale-down manual data quality efforts and scale up data throughput.
If you are a developer, take a look at the Quickstart and the Your First Validation exercises. They will get you up and running and introduce the CLI, the fastest way to get started. Reading about schemas vs. rules-based validation would be useful. Take a look the How-tos and DataOps integrations sections. There is a cheatsheet and validation language basics. And there is more information on the GitHub site.
For a higher-level view on the topics of edge governance and data preboarding, try the atesta analytics whitepapers. They are CsvPath focused, but speak to the overarching operational and organizational needs.
Data pre-boarding is everywhere. And yet it is dramatically undertooled. We're on a mission to upgrade preboarding and make CsvPath Framework the world's trusted publisher. Welcome aboard!