Python vs. CsvPath
CsvPath is developed in Python. Python provides the basic CVS parsing plumbing. We love Python! But — and you knew a but was coming — Python is not the best tool for CSV and Excel validation in many cases.
There are some situations where CsvPath is not like to be the answer:
Only ad-hoc CSVs or Excel files
Low expectations for correctness or only an occasional need to validate
Just one format
Low volume
In these cases, you might benefit from CsvPath, but you don't need it.
CsvPath was written for automated daily batch processing, large scale data collection, and many or changing data formats with high fidelity requirements. If you have those challenges, need is the better word — you need CsvPath. And if you find yourself or your colleagues checking the work of computers by hand, even more so! Sadly, we've seen that too often.
So why not Python?
The problems with doing CSV validation in Python include:
More code — in the example below the Python is ~200% of the CsvPath
Less readable — we want more people contributing to getting the rules right, not just that developer. While Python is readable, it is programming logic, not simple, declarative validation primitives. The more concise and self-documenting, the better.
Lack of guardrails — Python is a general purpose language to do anything, CsvPath is function-specific. This means long term your CsvPaths will drift less and hew to good patterns more.
Higher test burden — we would advocate putting your test effort into testing the constantly changing data, not the more constant code that processes it. CsvPath helps you do that by taking on much of the testing burden.
Those are good reasons. They are about CsvPath, the language.
In retrospect, the reference implementation, the Python library, probably should have had a different name. (Yeah, there may never be another implementation of CsvPath, but we aspire to a Rust version, so who knows?)
Benefits of the CsvPath library
The CsvPath libary brings a wealth of additional reasons to choose CsvPaths over rolling your own Python solution. Some of those features are delivered transparently or with minimal effort. Others are essentially hooks for application developers to weave CsvPath into their DataOps environment.
Some of those capabilities include:
Flexible error handling policies
Programmatic, config, and csvpath-driven logging
Printout capture and other reporting features
Data lines matching and capture
Metadata and runtime data capture
File metrics caching and file nicknames
Data exchange between csvpaths
Validation strategies that speed performance, offer chain-of-responsibility patterns, etc.
Multiple ways to organize csvpaths to suit different ways of managing validation
Lots of good stuff. Not everyone needs all that, but it's there. If you are a DataOps, DevOps, or data processing application developer some of that has got to sound interesting.
How about that Python example?
In Another Example, Part 1 and Another Example, Part 2 we created a CsvPath validation for a simplified retail goods order file. The rules were:
There would be top-matter to skip before the header line
The top matter would include two metadata fields to capture
The header line would have > 10 headers
Every data line below the headers would have a price in a certain format in the last position
There would be UPC and SKU values
The value under the category header would match one of a list
There would be more than 27 lines
This is not an unusual set of validation rules; although, we picked it because the top-matter is a good entry-level challenge.
The resulting csvpath clocked in at ~60 lines, but with a ton of whitespace and comments. Minus the whitespace, about 30 lines. Since CsvPath is a declarative path language, more like SQL or XPath than Python, we could arbitrarily bring the number of lines down closer to 7, technically. But whitespace is free.
A just-get-it-done Python version came in at about 130 lines. Now, lines of code is not the main issue, but it's a first good indicator. More importantly, the readability is significantly lessened.
But the thing that most makes the blood run cold: the lack of patterns and guardrails is immediately noticeable. If you think in terms of a partnership of any kind—commercial, government, research, what have you—you want to know your partner's code is disciplined and verified, not messy, siloed, and unloved. With CSVs the latter is too often the case. CsvPath cannot fix that problem completely, but it is a good first step. When you look at the example code below, imagine you have files in 100 formats coming in daily. Now imagine 1000 formats. And now imagine the developers playing hot-potato with the CSV code. You get the idea.
And without further ado, here's the Python implementation to compare to Another Example, Part 2. Put your DataOps engineering manager hat on and judge for yourself.
Last updated