Serial Or Breadth-first Runs?
Using CsvPaths, you have a choice to run multiple csvpaths against a file serially or line-by-line. What is the difference? Why would we choose one over the other?
The methods
On CsvPaths look to these methods for serial runs:
collect_paths()
fast_forward_paths()
next_paths()
Using these methods, CsvPaths runs a csvpath through every line in the CSV file before it moves to the next csvpath and restarts the file from the first line.
For "breadth-first" runs, look for these CsvPaths methods. They have every csvpath in the run examine each line before CsvPaths moves forward to the next line:
collect_by_line()
fast_forward_by_line()
next_by_line()
How to choose
Usually, top-to-bottom serially vs side-to-side breadth-first is not a big decision. In the usual case it just doesn't matter. However, as your use of CsvPath expands and your needs grow there are reasons why sometimes the choice becomes important.
A breadth-first run has as its most important consideration that it allows one csvpath to modify the inputs to the next csvpath. This is because all the csvpaths are working on the same data in memory.
A a simple example, this output came from a property inventory CSV made available by the City Of Boston. The top run is a serial run of two csvpaths. The bottom run is a parallel run of the same two csvpaths. Notice the change in capitalization in the #4 and #6 headers in the bottom, parallel run:
Here is the Python code that generated those results:
Line 3: we create our
CsvPaths
runnerLine 4: pick a CSV file
Lines 6 and 9: there are two simple csvpaths that we will run in serial and in parallel
Line 14: the serial run
Line 17: the breadth-first run
And here are the two cvspaths. The top, first csvpath:
And the second, bottom csvpath:
The file is >80 Mb but we're only looking at line 10 and the match components are simple so this is a quick test. And the file manager caches some metadata so that allow fast iterations. Together the csvpaths print out the tables in the screenshot above.
You can see that in lines 3 and 4 of the first csvpath we changed the city and street name headers by lower-casing and upper-casing them. In the serial execution that had no effect on the second csvpath. Whereas, in the breadth-first run you can see that what we did in property.csvpath
had an impact on what property2.csvpath
received.
Here is a visual the highlights the main considerations of serial vs. parallel runs. There aren't many. However, when they become important, they are very important.
Last updated