Rewind / replay
How to start a run from an intermediate point in a past run
Last updated
How to start a run from an intermediate point in a past run
Last updated
In Replay Using References we looked at how to start a run based off an intermediate state of a past run. Let's take a look at an example. This will be very artificial, but it will give you a more concrete understanding that you can bring to more real-world situations.
Implementing the rewind concept is build up from parts:
Named-paths groups
Csvpath identities
Collected data
source-mode: preceding
References
In brief: we rewind back csvpath steps in a named-paths group of csvpaths. Each of the csvpaths create an intermediate output by selecting lines and, optionally, changing or adding data. Source-mode preceding pipes intermediate form data from one csvpath to the next. And references give us a way to point to the intermediate data and pull it into a csvpath from the named-paths group to create a new run.
We'll start with a simple three csvpath named-paths group. Each csvpath is unimaginatively named: source1
, source2
, source3
. The csvpaths pipe their output from one into the next. We run the csvpaths in the group serially. After the first run we find something to improve in the second csvpath. We want to run the group again after our improvements. But ideally we would skip the first csvpath because that one is time consuming. What to do?
For more background on how results are stored please read: Where Do I Find Results?
Our named-paths group, called "sourcemode", looks like this:
You can run this group using this Python:
As usual, we create a CsvPaths
instance and register our delimited file with a simple name. Likewise with our named-paths group. In this case we're putting all three csvpaths in one file, so that's all we have to tell our CsvPaths
instance about.
After running this super simple setup, this is what we get.
The structure is:
The key result file for us for this example is data.csv. Every csvpath that is used to collect lines from a delimited file by a CsvPath instance sends its collected data to a data.csv file. (You can choose to not store the data on disk and of course you can also just not collect it).
You can see that these csvpaths each did their modifications to the data in a way that impacts the final results coming out of source3
.
source1
added a count
header and limited the lines collected to only those with both firstname and lastname
source2
appended a working
header with a random number from 0 to 5, and limited the lines collected to only those where count
is greater than 4
And, finally, source3
removed the count
header
The result is this:
That's all well and good. But let's make a change in source2
and set ourselves up to use the output of the first run of source1
as our input. To keep it simple, just change working
to thinking
.
Great! Now, because we want to save bytes and watts we're going to rewind to source2
using source1
's data.csv in our new run. Here's the Python:
I can hear you muttering that's it?!
Yeah, that's it. In fact, the last line is a complete add-on, you may or may not want to inspect the results programmatically. Basically, rewind is a one-liner.
The two references are pretty straightforward. The datatypes are important:
results
indicates that we're looking at the data resulting from running our named-paths
csvpaths
is the datatype that represents the named-paths group we're working with
The only other things to pay attention to are the tokens embedded in the references. There are four tokens you can use in references. They start with a colon:
For the date-stamps of runs:
:last
:first
For the position of csvpaths in their named-paths group:
:from
:to
As you would guess, they do exactly what you'd expect. The :first
and :last
tokens replace the right-hand side of run date-stamps to make them easier to remember and manipulate programmatically. The :from
and :to
tokens are appended to csvpath identities to indicate that we don't want just the csvpath identified, but also want its predecessors or successors.
The result after source3
is exactly what we were looking for. Which isn't much, in this trivial example, but still.
How do we know that we successfully did a rewind? Well, a couple of things. The biggest tell is that we don't have a results directory for source1
.
You can also look at the source2
meta.json
from the rewind run to see what the settings and inputs were. In a full automated DataOps Collect, Store, Validate pattern this is where the rubber hits the road. Or at least one place.
You can see on line 3 that we're sourcing data from source1
. On line 16 you can see that our configuration calls for us to grab the data.csv
output of the last csvpath.
Here is the same source2
meta.json
information from the first run. Because we did the runs in one script they ended up in adjacent directories: the first in 2024-11-12_08-03-54
and the second in 2024-11-12_08-03-54.0
. Notice that in the metadata shown above and below both times we pull source2
's data from 2024-11-12_08-03-54
, not from the second run 2024-11-12_08-03-54.0
.
You can see that we are pulling data from the named-file sourcemode
, not a physical file path. And you can see that we captured a source-mode-source
metadata key to identify how we swapped in source1
's data.csv
instead of using the sourcemode
file.
And that's about it. Rewind and replay are not hard. In fact, pretty darned easy, right?