Another Example, Part 1
Last updated
Last updated
In the first example we created validation rules for a straightforward file. In this next example we'll take on a harder format, handle more rules, and set ourselves up for production use. Let's dive in!
Some CSV files have documentation or ancillary data at the top, above the main data lines. In Excel this is even more common. While the additional data is useful, it complicates recognizing and handling headers correctly. Validation becomes more difficult. We're going to validate an example like some we've seen in the wild.
In this example, you will see more functions, headers, qualifiers, and references. We'll explain them as we go. Afterwards, to learn more about these topics you can take a look at:
Functions, including the list of all built-in functions
And, for a general overview of the match components, start here.
This is a simplified order for goods from retail stores. It is a monthly report delivered automatically.
Here is the file. It is shortened, but it is enough for this example.
There are so many things to possibly validate here. Types of products, IDs, prices. Just sticking with simple checks we could come up with many rules. We'll create a csvpath that applies six rules while dealing with the complicating top-matter.
As far as that top-matter, our csvpath has some requirements. It needs to:
Recognize that the header line is line 9 (0-based)
Discard the lines that are comments — unless
There is information we want to capture in that line, specifically: user and run ID
Doable? Absolutely!
And let's try doing it without just skipping the top 7 lines. Skipping the lines would obviously be trivial. Just create the scanning part of the csvpath like:
But let's say we don't trust that those lines will always be there or a consistent number to skip. And, anyway, we want that user and run ID metadata.
We need to pull two fields from the comments:
The username of the person to contact about the file
The run ID of the process that generated the file
As well, say that the orders files will always:
Be more than 10 headers wide
Have just 1 header row per file
Have more than 10 data rows
And of course there will be requirements for the fields:
The product category must be correct
Lines must have UPCs and SKUs
Prices must be in the right format
If this were an actual order file in a real-world situation we could imaging collecting average selling prices, checking the vender's name, looking for duplicate lines, and more. All things csvpaths can do.
We'll step through the strategies for each of the bullets. Then see them together in one file with a trivial Python script that runs the csvpath. Then we'll switch to CsvPaths to create a more long-term management friendly version. That will show you the advantages of small modular csvpaths working together.
You can capture the metadata using regular expressions. The comment lines are prefixed with a #
character. We can use that and a regex to grab the values we want and put them in variables.
Variables are new; we haven't seen them in the previous examples. Everytime you run a csvpath your CsvPath instance collects several things:
Metadata
Runtime stats
Variables
You can read about these types of data here. Variables are like session variables in web development. They come in three flavors:
Strings or numbers
Dictionaries of named "tracking values"
Stacks of strings or numbers
We'll see each of these types. The variables are available in:
Your CsvPath instance at path.variables
(assuming your CsvPath is called "path")
In your csvpath strings using @varname
Within any csvpath string print()
statements at $.variables.varname
Again, we'll see each of these uses of variables as we go.
So, then, let's capture that metadata! Take your last example and replace the rules with these two lines. You can also remove the validity check at the bottom, update the comment, etc.
Your file will look something like:
These two match components look at comment lines and capture data.
Notice that the starts_with()
function looks at the 0th header, #0
. We used the #
character to indicate headers in our last example. This is the same thing, except this time we're pointing to the headers using their 0-based index number. #0
is the first header. We use #0
because there is no proper header line at the top of the file so we don't have a name of the 0th header.
#0 looks a lot like the CSV file's comments. But we know comments in CsvPath use the ~
character. This CSV file just happens to use #
as its comment character.
The starts_with()
function is much like similarly named functions in Python, SQL, etc. It returns True
if a variable starts with a certain string.
Next we have a when/do expression. When the left-hand side of ->
is True, we do what is on the right-hand side.
The variable @runid
will capture the ID of the run if it is found in the comments. It has the notnone
qualifier. Qualifiers are explained on this page. The notnone
qualifier does what it sounds like. It prevents the variable from capturing the None
value. None
is the Python way of saying null.
Regular expressions
Regular expressions, or regexes, match substrings. They are powerful, inscrutable, incredibly useful in virtually every language and many tools, and even sometimes fun. Regardless, they are a deep topic on their own. We won't dissect these expressions here. You can learn more about regexes from many tutorials, including this one. This regex debugger may also be a help. The page on CsvPath's regex tools is here.
In brief, we're saying capture a value from a line that has the substring Run ID:
in the 0th header.
To recap, now that we have our metadata fields captured from the CSV file's comments your script should look like this:
When you run it you will see something like:
Notice that because this is the first time I ran this script in a new project CsvPath created a default config file and a logs directory. But more importantly, we got our two metadata values in the CsvPath instance's variables. Perfect!
Moving on from the comments at the top of the file, we know there is one header row. Our requirements said there are >= 10 headers. That's easy to spot. One way to do it might be:
These two match components handle those requirements.
The first one skips a line if the line doesn't have enough headers. At the top of the file, those are probably comment lines and we already took care of the comments. This is an illustration of how order matters in CsvPath.
Match components are activated from left to right, top to bottom. What I do in match component A may affect match component B. Or, in this case, we're just skipping B, C, D, etc. altogether.
count_headers_in_line()
explains itself pretty clearly. The reset_headers()
function updates the CsvPath instance's understanding of what the file's headers are. When we use reset_headers()
we are saying take the current line as being the header row. That resets the names of the headers to be the values in the current row.
When we see the number of line values jump to 10 or more we can safely assume we hit the header row and act accordingly.
Here is a more complete approach that includes a few nice-to-have things:
The three parts are:
We skip any line with less than 10 headers
Then we reset the headers if they jump to 10 or more
When we see the headers reset the first time we print an alert
The skip()
function is a way to jump ahead to the next line from anywhere in a csvpath. Again, order is important in csvpaths. When we use skip()
we are saying skip ahead to the next line without doing anything else. In this case, we use the lt()
, or less than, function to tell skip()
if it should act.
In the next block, we use the mismatch()
function. mismatch()
tells us the difference between how many header values we see in a line versus the number of headers for the file. Imagine if a SQL database table had four columns, but in one row there were only three columns. That could not happen in a SQL database, but in a CSV file, it happens frequently.
We use the string "signed"
to tell mismatch() that it should give us a positive or negative number, not just the absolute value. Without "signed"
when there are four headers in the file but only three in a row mismatch reports 1
. With "signed" mismatch() reports -1
. In our case, we only care if we suddenly have many more headers, so we need to see a positive number.
The last of the three parts has a lot going on. Here's the match component by itself:
This is our first use of the print()
function. Unlike most general purpose langages, in CsvPath printing is a big deal. The reason is that the act of validating a file means using rules to find unexpected things and communicating that information to the user. Schematron is the schema language most like CsvPath. It points to problems and communicate with the user in plain English. (Or your humans language of choice). XSD, JSON Schema, etc. all have the dual purpose of defining data structures and communicating when data doesn't match.
CsvPath has some nice printing tricks up its sleeve. We're using a couple of them.
First, the qualifiers. Qualifiers are awesome. They can do so many neat things to make match components more powerful. In this case we are using both onchange
and once
. Both do exactly what they sound like:
print()
with onchange
only prints when the print string is different. That is useful when you are printing variables that may not change in some lines. With onchange
you only see the print()
when it has something new to say.
print()
with once
simply prints just one time. That's it, one and done.
print()
with both onchange
and once
is even more interesting. It means that print()
takes its first chance to print. The first printout is considered a change. But then once
prevents any more printouts. This is good because mismatch()
will report 0
as the number of headers that are unexpectedly found or missing. That 0
would be considered a change by print()
. We don't need to be told that nothing happened.
As you can see, print()
can take a second argument that is executed only if print()
itself runs. We use this to nest a skip()
that jumps us to the next line. As with most things in CsvPath, we could have done this other ways, but this is a good approach.
Your script should now look like:
Running your script should look like what you would expect:
Congrats! You are done with the hard part. The rest of the validations are simple business rules.
Next we'll check if the product category is correct. This is a pretty straightforward rule.
Following the same pattern we saw in the last example, we are going to identify a problem row and print a validation message. The in()
function looks at the value of the #category
header and checks if it is in a delimited string.
We could equally well ask in()
to check against the values of one or more other match components. Or, we could even use a reference to point in()
towards a list of values created by a different csvpath. But for now we're keeping it simple with the delimited list.
Notice that we asked print()
to activate a fail()
function, as well as printing a message. fail()
sets the is_valid
property of a csvpath to False
. We are saying that the CSV file is invalid. If we needed to, we could use this information programmatically in Python, in CsvPath print()
messages, or to take action in other match components using the failed()
or valid()
functions as triggers.
When you run your script now, you see nothing changed. That's because all our categories are correct. To make sure your rule works change line twelve's category from OFFICE to OPERA.
Now you assuming your script looks like this:
When you run it you should see something like this:
Moving right along, the next rule is that prices must exist and be in the correct format.
Again we use the same rule pattern. In this case, we tap regular expressions again to check that a price:
Exists in the last header
Starts with a $, optionally
Is made of numbers
Has a decimal point
Has at most two numbers byond the decimal point
This probably isn't the perfect price field description, but it is pretty good.
The new thing here is end()
function. end()
is a pointer to the last header. If we're not certain what the last header name or index is we can use end()
to refer to it. Our options are:
Use a header name, possibly quoted, like #"a price"
Use an index like #14
Use end()
, possibly with an offset integer like like end(-2)
In this case, let's say we know price is always the last column.
Drop this rule in your script and run it. If your file follows how we've been creating the example, you will hear some complaints from Python. It is unhappy about invalid escapes. Ah, validity. Easy to fix. Change your line to:
The additional \
chars escape the regular expression escapes for Python. This is normal regular expression fun. We all give up some amount of our sanity for accessing the power of regexes.
Your script should now look like:
It is definitely getting long. But we're getting close to complete. And in Part 2 of this example we'll see how to make the validation csvpath much simpler and more manageable.
Run your script and you should see output like this:
Retailers, distributors, and manufacturers use Universal Product Codes to identify products. Retailers use Stock Keeping Units to identify and count products in inventory. Both are very important numbers to the business. If they were missing the orders file would be for sure invalid.
For all that, though, checking that they are present is easy. By now you should expect the pattern we are following: when/do. We're checking the #SKU and #UPC headers. Remember that CsvPath is case sensitive.
Paste or type in these two lines at the bottom of the csvpath string in your script.
When you run your script you again see no additional issues. So let's make one. In line 11 remove the value in the SKU field, the fourth header from the end: 9933757492
. It should look like this:
Run your script again. You should see something close to:
If that works correctly your script looks like this:
We're getting there! One more rule to go.
Our requirements say we are expecting a file that is greater than 10 lines. That gives room for the comments at the top and at least a one or two orders. If we don't see at least 10 lines we fail the file.
Let's use this match component:
We'll use print()
's once
qualifier to make sure we don't generate a lot of noise. Also note that we use a double period in two places. Doubling up periods is an escape for when you want to put a period directly at the end of a reference in a print()
. That is the only place it is needed.
With all our requirements met let's take a moment to clean up a bit. We don't need to display the updated headers, now that we know that is working. And we can put the last total_lines() check first in the list of concerns. Also, a few more comments wouldn't be a bad idea. Here's how your script will look with those changes:
I bumped up the number of lines requirement so that we could see the rule's effect. When you run this final version of your script you should see something like:
Congrats! You have a very functional csvpath to check your CSV files with.
As we saw in the first two examples, the Python side of CsvPath is easy.
Step-by-step:
We import CsvPath and create an instance that will run our csvpath
Assuming we drop our csvpath into the ORDER_RULES variable, we parse our csvpath to prepare it to run
Since we aren't actually using the matched lines programmatically let's switch to the fast_forward()
method. With path.fast_forward()
we are asking CsvPath to run the CSV file through our csvpath without stopping or returning any data to us. We do that because our validations all print their gotchas report-style.
Last, we print out our csvpath's verdict: is the CSV file goood, or not
For this simple example that's enough. But in a production setting you might imagine sending an email, updating a database, moving the file to a good or not good directory, or the like.
That's a lot. But the result is well organized, commented, and self-documenting. Of course, longer validation rulesets are easy to find, but this one isn't trivial.
Still, we have to ask, is there a better way? Something more manageable over the long term? An approach that might be easier to develop and debug?
Yes, absolutely! We'll take all that on in Part 2 of this example.