Another Example, Part 2
In this second part of the orders data file example we are going to modify our csvpath and its Python runner to be more automation and management friendly.
Requirements
First let's call out the requirements. What are we trying to achieve?
Separate the code from the csvpath
Break down the validation rules
Give ourselves more options for reporting and error handling
Set ourselves up for making references to other data for lookups
Let's say more.
Separate the code
Separating the code from the csvpath is straightforward. We don't want to colocate the csvpath in a Python variable. That constrains our formatting choices, complicates updates and the code, and makes the solution single use. We don't want to keep that physical file reference built into the csvpath, either. Our ideal root for our path is this:
Componentize the csvpath rules
Secondly, having our six validation concerns in one file makes us work harder in development. With six rules you have to take all of them into account when you test your csvpath. Writing and testing one rule at a time, separately, is simpler. Our solution, ideally, should let us have six csvpaths that we can run and manage as a group.
Capture print statements and errors
Third, printing validation messages is an excellent way to do data validation. This kind of reporting may seem simple, but simple is good. That said, could we wish for more control over the output? Sure. We might want to craft an email or some other kind of report. It would be nice to be able to do that without scraping the command line.
Likewise with error handling. We'll test our csvpath, of course! But at runtime things happen. And keep in mind that the data may change—we don't control that. So we'd like error handling that is robust with outputs that are inspectable.
References to other data
Finally, our csvpath doesn't currently refer to outside data, but it could. And, given what the data is, there is a strong possibility that in a real scenario we would want to check UPCs or company names, or other things against known-good values. To do that we need our references to be able to find other data sets. The way our script is currently set up, we can't do that.
The Solution
We're going to use a CsvPaths instance. CsvPaths is a manager class. It organizes CSV files, csvpath strings, and CsvPath instances and their run results.
A CsvPaths instance creates an instance of CsvPath for each csvpath it runs. It has managers to find files, sets of csvpaths to run, and capture the results of runs. And it gives you options that let you decide the best way to run multiple csvpaths for your situtation.
As you may already know from other pages, the main difference in setting up a CsvPaths instance is that you need to point its managers to your files. Your results live in the results manager. You access them using the same name as you used for the paths that created the results.
That's about it. Easy!
The Python
In this second part of the example, we're going to change our csvpath string more than the Python code. But the Python motivates our changes to the csvpath. So let's start with the Python.
Create a new directory and make a Python file with the contents below. Also create a cvspaths
directory and a csvs
directory.
None of this is complicated stuff. Let's go through it.
Printing and logging
First we import the CsvPath library and create an instance of CsvPaths
. We create it with print_default=False
.
print_default=False
prevents the CsvPath
instances that run your csvpaths from printing to the default command line Printer
. Your print statements will still be captured and available with your results from the ResultsManager
. By default both things would happen: you would see results on the command line and you would also get print statements with your results.
Now that you've seen print_default=False
, go ahead and delete it. We'll want to see the output in the terminal, at least at first.
Next, we do another completely optional thing: set the logging level. We're just seeing how to do it, for future reference. Out of the box, CsvPath is already set to warn
by default. You can read more about setting up logging and error handling policies here.
Named-files, named-paths, named-results
On to more important things. First, we set up the named-files and named-paths. Named-files are just short names that point to full filesystem paths. They are convenient and help you keep your CSV files organized.
Named-paths are more interesting. These are sets of csvpaths strings that can be run as a group. You set them up by one of:
Creating a
dict
in Python or using a JSON filePointing to a directory full of csvpath files
Pointing to a file that contains multiple csvpaths strings
We'll try each. When we're done with Part 2 of our orders file example, we're going to have the last one—multiple csvpaths in a single file—but this time in a more DataOps-friendly way than in Part 1.
For more background, you can read more about named-paths and named-files here.
The fast_forward_paths() method
After we set up the named files and named paths we have CsvPaths do fast_forward_paths()
. This is similar to the fast_forward()
method we called on our CsvPath
instance in the first part of the example. The difference is that CsvPaths runs all the csvpaths in the named-paths you are using. That's CsvPaths's job: running multiple csvpaths against CSV files.
With fast_forward_paths()
CsvPaths runs your csvpaths serially. That means a few things:
Order is guaranteed
Results of earlier csvpaths are available for later csvpaths to use
Every row is seen before the next csvpath is run, making side-effects like
print()
simplerEvery csvpath has to iterate on the same CSV file, which is not the most efficient way
There is an equally simple to use fast forward method that run the csvpaths breadth-first: fast_forward_by_line()
. That method solves the efficiency problem. But let's leave it as a topic for another time.
At the bottom of the Python we pull the results from the results manager using the same name as our named-paths name. We store them under the same name because the results are the result of running those csvpaths.
Improving Our Csvpath
The cool part happens over in the csvpath.
We want to achieve a few things.
Easy development of independent rules
Separated results, printouts, and error handling
Overall better flexibility
We're going to do this by creating six csvpaths, in place of the one we have now. When we're done we'll have the option of putting them all in one file or in multiple files in one directory.
There's only one real challenge with breaking down our big csvpath into multiple little ones. It is a challenge that is specific to this example. The challenge is the top-matter that precedes the data in our CSV file.
Each csvpath has to handle the top matter. It definitely complicates things. That exactly why we used it as the example! We're talking about this part:
Since this part came first in our csvpath and shielded all the remaining match components below it, we need a way to give that same shielding to our match components in their individual csvpaths. And we don't want to repeat this code six times.
top_matter_import.csvpath
The answer is to use the import()
function. We can import this fragment into our other csvpaths to get the same effect without cut-and-paste. Creating an importable csvpath is simple because its just a csvpath. Create a file named top_matter_import.csvpath
. Copy in this csvpath and you're done:
We're going to create six more csvpaths files. Most will import this one. No reason to go slow since you've already created these once in a single file, so here we go.
metadata.csvpath
Wait, what's that ---- CSVPATH ----
block? I'm glad you asked. It is a separator between csvpaths that live in the same file. It is of course completely optional at this stage because we're using multiple files with one csvpath in each. Doesn't hurt to have it there, though.
reset.csvpath
In this file you'll see our top_matter_import
for the first time.
You may have noticed that these csvpath files have comments outside the csvpath itself. These comments are important. While we're not doing anything with them at the moment, we could add metadata fields that describe the csvpath. If you add an ID
or a name
metadata field to a comment, you can use it to reference the individual csvpath, even when it is bundled with others under the same named-paths name. Important, for sure, but we'll look at how to do that another time.
Also notice that we call stop()
. This ends processing for this csvpath. It does not affect the other csvpaths that our CsvPaths
instance is also running as part of this batch. CsvPaths will count the number of stops()
and if all csvpaths in the batch are stopped it will stop the run early. Since we're running our csvpaths serially, when CsvPaths gets the stop()
it just skips on to the next csvpath.
file_length.csvpath
categories.csvpath
Two more csvpaths to go!
prices.csvpath
sku_upc.csvpath
There, that's all of them.
You should now have seven .csvpath files in your csvpaths dir. It's worth pointing out, you don't have to use the .csvpath extension. We use it because it makes it clear what is contained in the file. But whatever works for you is fine. If you choose to use another file extension take a quick look in config/config.ini
to make sure to let CsvPath know what your extension of choice is.
Back to your modified script. Here's where we left it. From this you can run any of the csvpath files you just created. Each file will be named in the named-paths manager by its name (minus the extension). Or you could iterate over your csvpaths by name. Ultimately, this isn't the way we want to run them, but it is a good test.
The results should be completely unsurprising. These are just the same csvpath steps we created in Part 1. They are just pulled apart for easier development and management. But take a look back at the results of Part 1. That script is far more complex than these small, mostly independent csvpaths. The change is a win for both rapid development and long term maintainability.
Deployment Choices
We're not quite done, though. We want to run all of the csvpaths as a single unit. Sure, we can loop over them, as we just did. But we can do better! Let's take one more step: bringing them all together in a neat package.
This need raises the big question: do we want all the working csvpaths in one file or one directory, or do we want to create a JSON file that identifies the groups of csvpaths that are used together? So many options! Luckily all are easy to do.
All are good approaches. But the JSON option sure has a lot going for it. Let's look at that option first.
The JSON option
Using a JSON file to define the group of csvpaths is not hard. Create a JSON file that looks like this:
This creates a "named-paths" group named orders
. We need a second named-paths group for the top matter csvpath. It has to be separate so that it can be imported.
You would run orders csvpaths using Python like this:
That's easy. At the cost of creating one more file, you get all the benefits with very little work.
The directory option
Now the directory option. Oh, wait, we already did that. Just make two minor changes in the Python:
We still need two named-paths groups, so we have to create a directory for each. Call the directories orders and top_matter_import, though the names don't matter. Copy the csvpaths files into their directory.
Then we just import the directories, giving each a name. Last time we used add_named_paths_from_dir
we did not specify a name, resulting in each file being its own named-paths group.
The one file option
The last option, putting all your csvpaths in one file, takes the most work. It's not hard, but it takes a minute, unlike the other two options.
Create a file called csvpaths/orders.csvpath
. Open the file. We're going to fill it with the six csvpaths we created in the six files. They will be separated by the ---- CSVPATH ----
separator. That's four dashes, a space, the word CSVPATH
in uppercase, another space, and four more dashes.
Paste in the paths between the separators. Use the order we named them in our Python script:
metadata
reset
file_length
categories
prices
sku_upc
The order is not super important to this example. But order is important in CsvPaths in general, so it's good to get used to being aware of it.
We're leaving aside our top matter csvpath. It needs to be imported so it needs to be its own named-paths group.
Now orders.csvpath
should look like this:
And top_matter_import.csvpath
should look like this:
Now, with our orders file still in the csvs directory, those same two lines of Python provide the lightning. Your script should look like:
As you can see, this approach has the fewest files. It is definitely a bit less flexible than the other options. But its conciseness allows you to manage few assets, version control and view all the csvpaths together, and run them in the most obviously deterministic way. (Emphasis on the word "obviously"; the JSON option is also completely deterministic).
Options are good! You can pick whichever deployment option is best for your requirements and work style. Regardless of the choice, you know you are getting the same result.
And that's it. Congrats on a job well done! You now have an automation-friendly rule set using a pattern that will scale to any size DataOps operation. And you have three options for how to package and deploy your CsvPath files. Way to go!
Last updated