Getting Started With CsvPath + OpenLineage
Get started with Edge Data Governance the easy way. The instructions on this page should take you 15 to 45 minutes, depending on network speeds, docker startup times, etc.
Last updated
Get started with Edge Data Governance the easy way. The instructions on this page should take you 15 to 45 minutes, depending on network speeds, docker startup times, etc.
Last updated
First a bit on what we're aiming to do and why.
Lineage is about tracking the changes to data sets and their usage over time with the goal of explaining how every state in the data lifecycle happened. Clear lineage data makes finding, explaining, and fixing problems easier. To get a clear view of the lineage of a data set you need metadata — lots of it — and a way to analyze the information to tell the story of how things happened.
OpenLineage is an open standard for event-based lineage capture. Marquez is the server and webapp providing the reference API to collect and display OpenLineage events. CsvPath is an OpenLineage event source that provides copious metadata describing how your data moves through a consistent onboarding lifecycle.
Together these open source tools fill the gap between MFT (managed file transfer) and the typical data lake architecture. They provide an unprecedented level of visibility into your data onboarding operation. With workflow, transformation, and processing tools like dbt, Airflow, and Spark also throwing off OpenLineage events, you now have a straightforward way to collect end-to-end lineage. From data partner, to data lake, to analytics and applications, and back out to the World as a data product or service.
That sets up your CsvPath library. For this example we only need the CLI so we're almost done. We'll create a dummy CsvPath language file to run and some dummy data in a moment.
Clone the Marquez Github:
In the marquez
directory do:
After the images download and the server starts you should be done setting up Marquez.
Create another file called lineage_example.csvpath
. Paste in this:
Fire up the CsvPath CLI. Do:
If you are not using Poetry have a look at pyproject.toml
to see the plain command to use to start the CLI.
The CLI will look like this
Select named-files
and then add named-file
. You'll be asked for a name. Give the name test
. Then you will see options for an individual file, a JSON list of files, or adding a directory of files:
Select file
. You will see a listing of your directory. Pick test.csv
:
After CsvPath adds your input data file you go back to the top menu. This time select named-paths
and then add named-paths
. You should see:
You'll be asked for a name. Give the name lineage_example
. You will again be asked if you are picking a file of csvpaths, a directory, or a JSON file. Again pick file. You will be presented with your directory:
Pick your lineage_example.csvpath
file. And you're done with that part of the setup. Next let's modify the config.ini
slightly.
We also need to uncomment the [listeners]
and [marquez]
settings. When you've made those changes your config file should look like:
Notice we made the archive name Sunshine_Inc
. Do the same. Marquez doesn't like spaces so be sure to use the _
.
At the top level select run:
You will be asked to pick the file to run from a list. There is one option, so pick that.
Next you will be asked for the named-paths group. Again you'll have a list of one, so pick the one.
And finally you'll be asked to pick a run strategy by method name. If you've been doing other examples you'll know that collect
keeps the matching rows and fast-forward
does not. For our purposes it doesn't matter which we choose, but pick collect
.
You should get a message indicating that your run completed:
We're good. We should see our run in Marquez.
Switch to the Jobs vertical tab on the left-hand side.
Then look at the top right for the namespaces dropdown. Select Sunshine_Inc
. If you don't see our namespace right away, refresh the page.
You should now see your job events!
Click on Group:lineage_example.Instance:first lin... to open your core job. There will be other jobs that were for staging assets. You can learn about everything you are seeing on other pages of this site.
And there you have it. A local install of Marquez integrated with a CsvPath project. Clarity and consistency! Not bad for a few minutes work. And a good start on the journey to stronger edge governance and operational efficiency.
To start, create a new CsvPath project. As usual we'll use Poetry, but of course you can use Pip or any Python project tool. Call your project lineage_example
.
Install Docker desktop, if you don't already have it. You'll need to create a Dockerhub account. It should be painless.
Next install Marquez. Read this page because it's interesting and tells you much more about Marquez than our page does.
Back to CsvPath. Create a file in your project directory called test.csv
. Paste in our usual test data.
Before we can run your files we need to stage them in the CsvPath framework's inputs
directory. We also need to tell the CsvPath library that it should send events to Marquez. We'll add the files first because that will give CsvPath the opportunity to create directories and config files for us.
Open ./config/config.ini
. We want to make two changes. First we'll change the archive name. You don't really need to do this, but since your example isn't real work, why not separate it?
Now we can run our csvpath! Restart your CLI so it has your config changes.
Open http://localhost:3000/. Remember, you'll start out looking at the default
namespace. default
is empty. We pushed our events to Sunshine_Inc
.