CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • Comparisons
  • Data Frames
  • XPath and Schematron
  • SQL
  • Excel Functions and Macros
  • Once More, Why CsvPath?
  1. Topics
  2. High-level Topics

Why CsvPath?

As has been said on other pages, in technology there are always multiple ways to get to a result. So why should you use CsvPath when there are other options?

PreviousHigh-level TopicsNextCsvPath Use Cases

Last updated 9 months ago

CsvPath has the benefit of being simple and quick enough for technically savvy non-developers, as well for as developers who would rather spend their time on other things. Because CsvPath is purpose-built for CSV validation, it is not hard to use well and doesn't tempt you to treat it as a Swiss Army knife.

Beyond that, there are the comparisons. If the alternative is not validating your CSV, using CsvPath is a nobrainer. If you are looking at the four nearest alternatives, there are some more specific arguments in favor of CsvPath.

Comparisons

The nearest alternatives are Python data frames libraries, XPath in the form of Schematron, SQL, and Excel functions and macros. Each has a number of drawbacks or gaps. Let's go through them. Then we'll recap out why you should consider CsvPath.

Data Frames

Pandas, Polars, et. al. are great for data science and data engineering. They are fast and rich. The downsides of using data frames tools for CSV validation are:

  • The cost is essentially the cost of building your own validation software

  • Many Python developers who know data frames, or want to, would prefer not to be using it for simple validation. Developers who don't know data frames face a steep learning curve.

  • Over time, Python scripts will drift and grow barnacles. That's the nature of using a general purpose programming language for simple tasks with a long life and typically too little ownership. A built-for-task validation framework is more likely to stay on point and be reasonably well organized over the long term as maintainers come and go.

XPath and Schematron

Schematron is a terrific tool, still. When it was new it was a fresh idea that was simple — an a-ha! at first sight. Many years and three ISO-standard versions later, Schematron is still a great idea, but not looking so simple or World-fire-setting. Part of that is XPath. Never an easy technology, it has only gotten more challenging as it has grown.

For the comparison with CsvPath, though, the important thing is that Schematron does not do CSV. If it did, we'd use it! Instead, you would have to get your CSV into XML first. Now, it is true that CSV generally has to be transformed into something. But XML is not the most common target. If you are targeting XML, there's your answer. But if you're targeting a SQL database, APIs, Excel or something similar, converting to XML first would be a development, infrastructure, and people expense that carries its own QA concerns. Not the ideal path.

SQL

SQL is probably the answer to beat. (Apologies to the Pandas and Polars crowd). Most often your CSV data is going to end up in a SQL database, anyway. We think data validation and QC during ETL/ELT and in the database is a great idea. Everyone should be doing it! However, there is room for a function-specific validation tool that sits up front of the ETL and downstream of the database.

You want to check your data as early and as late as possible. On the inbound side, keeping your data lake clean and your business processes and business partner relationships as simple as possible means checking validity ASAP. And on the downstream side, you know your data better than any one of your business partners, so you can be more through in your QC and do it in the last mile, just before you hand off your data. In short, SQL isn't the answer to beat, CsvPath is SQL's +1.

Excel Functions and Macros

The two biggest problems with using Excel for validating CSVs are:

  • Excel struggles with large files

  • Most Excel-based QC is human-driven and couldn't be easily automated within Excel

Unreliable, expensive, human-centered processes are at the core of what what CsvPath is trying to fix. Even when Excel can open and be productive with a file, there is the basic problems of QC based on human inspection. Instead you want to get the process externalized from Excel—ideally having nothing to do with Excel—and manage it in a controlled and automated way, same as you would for any devops or dataops process. While it is a change and change is hard, it always ends up being easier and less costly. And using a purpose-built CSV validation tool is a key part of how that quality goes up while costs come down.

Once More, Why CsvPath?

It's simple. You want to automate quality to get expensive and fallible humans out of the loop. You want to apply QC as early in the data intake process as possible. You also need to apply QC as late as possible, just before your data files go out the door. Doing that requires a purpose-built technology that makes validation assets simple and maintainable. And lowering costs requires enabling a wide range of participants to help craft validation rules, rather than relying on a small number of power users who are probably needed on other things.

You can use a range of tools for CSV validation, of course! But sometimes when you are holding a hammer the thing you're looking at is actually a nail.

The World has a love / hate relationship with Excel. It is everywhere and comfortable for most people. But it is also slow, quirky and arcane, hard to manage, and, , very very likely to result in an ever growing pile of ungoverned spreadsheets with bugs. What's not to love? Most of it.

according to data science researchers who cared enough to look
There are probably dozens of ways to do CSV validation. These are four of the leading contenders.