> For the complete documentation index, see [llms.txt](https://www.csvpath.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://www.csvpath.org/getting-started/dataops-integrations/openlineage.md). # OpenLineage

First a bit on what we're aiming to do and why. Lineage is about tracking the changes to data sets and their usage over time with the goal of explaining how every state in the data lifecycle happened. Clear lineage data makes finding, explaining, and fixing problems easier. To get a clear view of the lineage of a data set you need metadata — lots of it — and a way to analyze the information to tell the story of how things happened. [OpenLineage](https://openlineage.io/) is an open standard for event-based lineage capture. [Marquez](https://peppy-sprite-186812.netlify.app/) is the server and webapp providing the reference API to collect and display OpenLineage events. CsvPath is an OpenLineage event source that provides copious metadata describing how your data moves through a consistent onboarding lifecycle. Together these open source tools **fill the gap between MFT (managed file transfer) and the typical data lake architecture**. They provide an unprecedented level of visibility into your data onboarding operation. With workflow, transformation, and processing tools like **dbt**, **Airflow**, and **Spark** also throwing off OpenLineage events, you now have a straightforward way to collect end-to-end lineage. From data partner, to data lake, to analytics and applications, and back out to the World as a data product or service.

## How to start

To start, create a new CsvPath project. As usual we'll use Poetry, but of course you can use Pip or any Python project tool. Call your `project lineage_example`. ``` poetry new lineage_example ```

That sets up your CsvPath library. For this example we only need the CLI so we're almost done. We'll create a dummy CsvPath language file to run and some dummy data in a moment.

Install [Docker desktop](https://www.docker.com/products/docker-desktop/), if you don't already have it. You'll need to create a Dockerhub account. It should be painless.

Next install Marquez. [Read this page](https://peppy-sprite-186812.netlify.app/docs/quickstart) because it's interesting and tells you much more about Marquez than our page does. Clone the Marquez Github: ```url git clone https://github.com/MarquezProject/marquez && cd marquez ``` In the `marquez` directory do: ```bash ./docker/up.sh ``` After the images download and the server starts you should be done setting up Marquez.

Back to CsvPath. Create a file in your project directory called `test.csv`. Paste in our usual test data. ```csv firstname,lastname,say David,Kermit,hi! Fish,Bat,blurgh... Frog,Bat,ribbit... Bug,Bat,sniffle sniffle... Bird,Bat,flap flap... Ants,Bat,skriffle... Slug,Bat,oozeeee... Frog,Bat,growl ``` Create another file called `lineage_example.csvpath`. Paste in this: ```xquery ~ id: first lineage example ~ $[*][ yes()] ```

Before we can run your files we need to stage them in the CsvPath framework's `inputs` directory. We also need to tell the CsvPath library that it should send events to Marquez. We'll add the files first because that will give CsvPath the opportunity to create directories and config files for us. Fire up the CsvPath CLI. Do: ``` poetry run cli ``` If you are not using Poetry have a look at `pyproject.toml` to see the plain command to use to start the CLI. The CLI will look like this

Select `named-files` and then `add named-file`. You'll be asked for a name. Give the name `test`. Then you will see options for an individual file, a JSON list of files, or adding a directory of files:

Select `file`. You will see a listing of your directory. Pick `test.csv`:

After CsvPath adds your input data file you go back to the top menu. This time select `named-paths` and then `add named-paths`. You should see:

You'll be asked for a name. Give the name `lineage_example`. You will again be asked if you are picking a file of csvpaths, a directory, or a JSON file. Again pick file. You will be presented with your directory:

Pick your `lineage_example.csvpath` file. And you're done with that part of the setup. Next let's modify the `config.ini` slightly.

Open `./config/config.ini`. We want to make two changes. First we'll change the archive name. You don't really need to do this, but since your example isn't real work, why not separate it? We also need to uncomment the `[listeners]` and `[marquez]` settings. When you've made those changes your config file should look like:

Notice we made the archive name `Sunshine_Inc`. Do the same. Marquez doesn't like spaces so be sure to use the `_`.

Now we can run our csvpath! Restart your CLI so it has your config changes. At the top level select run:

You will be asked to pick the file to run from a list. There is one option, so pick that.

Next you will be asked for the named-paths group. Again you'll have a list of one, so pick the one.

And finally you'll be asked to pick a run strategy by method name. If you've been doing other examples you'll know that `collect` keeps the matching rows and `fast-forward` does not. For our purposes it doesn't matter which we choose, but pick `collect`.

You should get a message indicating that your run completed:

We're good. We should see our run in Marquez.

Open . Remember, you'll start out looking at the `default` namespace. `default` is empty. We pushed our events to `Sunshine_Inc`. Switch to the Jobs vertical tab on the left-hand side.

Then look at the top right for the namespaces dropdown. Select `Sunshine_Inc`. If you don't see our namespace right away, refresh the page.

You should now see your job events!

Click on [Group:lineage\_example.Instance:first lin...](http://localhost:3000/lineage/job/Sunshine_Inc/Group%3Alineage_example.Instance%3Afirst%20lineage%20example) to open your core job. There will be other jobs that were for staging assets. You can learn about everything you are seeing on other pages of this site.

And there you have it. A local install of Marquez integrated with a CsvPath project. Clarity and consistency! Not bad for a few minutes work. And a good start on the journey to stronger edge governance and operational efficiency.