Your First Validation, The Hard Way
Super simple rules to demonstrate how validation works
Last updated
Super simple rules to demonstrate how validation works
Last updated
As you saw in the Quickstart, CsvPath is easy to set up. But what value does it add?
CsvPath data quality checks for CSV and Excel files is a unique capability. Let's see how it work by making the Quickstart into something a little more realistic.
In this example we're going to first take a rules-based approach that will give you a sense of the CsvPath Language's power. In a following part, we will turn around and create the simplest thing that could possibly work for the same simple example, giving you a sense for how easy the CsvPath Language can can be. And in the third part, we remove the Python by adding the CsvPath Library's command line interface.
Our mission is to write a csvpath that can keep some bad data from getting loaded into our data management systems. Our goals for the data are:
There cannot be blanks; every header must have data
The files' lastname
header cannot have values greater than 30 characters
The first header must be firstname
Our strategy is to collect any lines that break these rules. The validity of the file as a whole will depend on there being no failing lines.
By default, the CsvPath library config raises errors when it runs into problems. This can halt a run for a simple built-in validation error. For e.g. add("five", 3)
does not equal 8
because "five" is not a number. Finding that error is good, but in some cases you might not want the validation run to stop with a big noisy error.
To have your CsvPath
instance neatly print errors, but not raise exceptions, you can make a quick change to the config.ini
file. By default the config file is in ./config/config.ini
. Make the change to this line:
As you can see, in my config.ini
the CsvPath
instance error policy is on line 8. Just remove the keyword raise
from that line, save, and you're good to go. Now, for that incorrect addition above, you get a concise error message like this, rather than a big exception stack dump: Line 3: Wrong value in match component 0: mismatch in add
Paste this rule into the csvpath string. It goes in the matching part, after the scanning part. The scanning part of a csvpath is at the beginning. It looks like this: [1*][
~ Apply three rules to check if a CSV file meets expectations ~
Your file should now look something like:
Csvpaths can include line breaks, so you can format your csvpath any way you like.
What did we just do? We added a comment saying what our csvpath does. That's all. Comments are completely optional, but they are useful.
Here is the file so far.
Let's continue.
Our csvpath's first rule is that valid files have data under every header. CsvPath looks at the 0th line to determine the file's headers. Of course, some files don't include headers. We'll see how to handle that later.
Add a test for blanks below the comment. Make it replace the yes()
function:
What does this statement do?
As we iterate through the CSV file line by line, the all()
looks at values and returns True
if all of them have data. We pass all()
a header()
function to direct it to look in all the headers.
In our last script we used the fast_forward()
method to run though the CSV file without interacting with it. This time, let's collect the offending lines. To do that, we need to use not()
to match any line that has blanks. We also have to switch from fast_forward()
to collect()
.
And one more change. Rather than using is_valid, let's say that the count of invalid lines determines if our file is valid. collect()
returns the lines. Let's just print out how many.
When you've made these changes, your script should look like:
Try it out. You should see that our file is still valid. Here's our CSV file again.
To see that the script works, edit the CSV file to make a blank. Take Sam Cat's lastname out.
This time your script should tell you that there is one invalid line.
Next let's create our rule about long lastnames. Our requirements said that lastnames cannot be longer than 30 characters.
Add this line to your csvpath below the line about blanks.
This line uses functions to check a particular header's length. We're saying that values under the lastname
header must be less than 30 characters long. Csvpaths use #
to indicate a header.
Your csvpath should now look like:
Pretty straightforward, right? But wait! We can do better. Swap out the line you just added for:
That's much better! There are more than 75 built-in functions so there is often a simpler way to do things. The list of functions is on the Github repo.
Here's what your file should be like now.
Now, we have a problem. The problem is that we are ANDing these two rules together. What's the problem? When we are ANDing, by default, we match a line if all the rules are true. In this case, though, we are expecting lines to match if only one of the two rules is true.
What we need is an OR operation. That's easy! Just add this line right below where you created your CsvPath instance.
This change will make CsvPath match lines where any one or more rules are true. This way, we can call out the possible problems with the file and collect lines that match those problems without the rules interfering with each other.
Your script should now look like:
Try running it. You should still see that there is one invalid line. After you confirm that, try creating a very long lastname for Sam the Cat and no lastname for Fred the Dog. CsvPath should collect the two invalid lines.
Our validation csvpath wants the first column to be called firstname
. That seems like an easy rule to satisify, right? Just type it that way.
We will of course do that. But consider if your csvpath is going to be validating files from a nightly batch job. Since we don't want to rock the keyboard at 3 a.m., we need a rule to check for us, even if it is just a simple rule.
Add this below the max length rule:
This rule says that the 0th header must be the firstname
header. Headers can be accessed by their numeric position — their index. The index is 0-based, like a Python list. That means the first header is #0
. In our rule, the function looks up the name of the header indicated by 0
. If it doesn't equal "firstname"
we collect the invalid line.
Let's pause on an interesting point.
CsvPath applies a csvpath to a CSV file line by line. It can collect each line that matches its rules. CsvPath is all about matching lines.
In our rule, the column()
function will be called on every line, not just the header line. That means that if the first header is not firstname
, our rule would collect every line in the file. For our purposes, that would indicate that all the lines were bad. In reality, in that case, we would consider the file as a whole invalid, not the individual lines.
What to do? Easy! Change the line you just added so that the rule looks like:
Now the rule doesn't determine if individual lines match match the header name requirement. Instead it marks the file as invalid. This allows you to collect the lines with problems and also know if the file has problems that cut across the lines.
Your script should now look something like:
There are a couple of new things here. First there is a qualifier, nocontrib
, on the not()
that wraps the header_name()
function. Qualifiers modify the behavior of functions and other match components. You can read all about qualifiers here.
The nocontrib
qualifier says that the function it is on should not determine if the line matches. It does not contribute. Again, we need this because every line will match on the header name problem.
The second new thing is the when/do. That's the ->
operator. When/do says that when the left hand side is True
the right hand side happens. In this case, when there is a header name mismatch we fail the file using the fail()
. If you are concerned that this when/do could be executed many times, you are right. But in this case that doesn't matter because calling fail()
is cheap in performance terms and always has the same effect.
Now when you run your script you will see... nothing new. Let's add a print statement to let us know if the file overall is valid. Put this at the bottom of your script.
Now you should see something like:
You might wonder if we should really be saying that the file as a whole is invalid only if we find a problem with the headers. Isn't it a problem that two lines are invalid, as well? That is a very reasonable opinion. CsvPath cannot make assumption about if you are collecting invalid lines or valid ones. That is a decision about your validation strategy—as we have said, there are always multiple ways to do things.
However, we can easily mark the file invalid when we have lines failing. Just add this to the bottom of your csvpath string:
And for consistency's sake, modify your error message:
Now our validation script is giving us complete and actionable information about our file. There's much more we could do, of course—we're just scratching the surface! We'll explore some of those options and improvements in the next example.
Here is the final state of the script to compare against your own.