CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  1. Getting Started
  2. DataOps Integrations

Getting Started With CsvPath + OpenLineage

Get started with Edge Data Governance the easy way. The instructions on this page should take you 15 to 45 minutes, depending on network speeds, docker startup times, etc.

PreviousGetting Started with CsvPath + OpenTelemetryNextGetting Started with CsvPath + SFTPPlus

Last updated 6 months ago

First a bit on what we're aiming to do and why.

Lineage is about tracking the changes to data sets and their usage over time with the goal of explaining how every state in the data lifecycle happened. Clear lineage data makes finding, explaining, and fixing problems easier. To get a clear view of the lineage of a data set you need metadata — lots of it — and a way to analyze the information to tell the story of how things happened.

is an open standard for event-based lineage capture. is the server and webapp providing the reference API to collect and display OpenLineage events. CsvPath is an OpenLineage event source that provides copious metadata describing how your data moves through a consistent onboarding lifecycle.

Together these open source tools fill the gap between MFT (managed file transfer) and the typical data lake architecture. They provide an unprecedented level of visibility into your data onboarding operation. With workflow, transformation, and processing tools like dbt, Airflow, and Spark also throwing off OpenLineage events, you now have a straightforward way to collect end-to-end lineage. From data partner, to data lake, to analytics and applications, and back out to the World as a data product or service.

How to start

poetry new lineage_example

That sets up your CsvPath library. For this example we only need the CLI so we're almost done. We'll create a dummy CsvPath language file to run and some dummy data in a moment.

Clone the Marquez Github:

git clone https://github.com/MarquezProject/marquez && cd marquez

In the marquez directory do:

./docker/up.sh

After the images download and the server starts you should be done setting up Marquez.

firstname,lastname,say
David,Kermit,hi!
Fish,Bat,blurgh...
Frog,Bat,ribbit...
Bug,Bat,sniffle sniffle...
Bird,Bat,flap flap...
Ants,Bat,skriffle...
Slug,Bat,oozeeee...
Frog,Bat,growl

Create another file called lineage_example.csvpath. Paste in this:

~ id: first lineage example ~
$[*][ yes()]

Fire up the CsvPath CLI. Do:

poetry run cli

If you are not using Poetry have a look at pyproject.toml to see the plain command to use to start the CLI.

The CLI will look like this

Select named-files and then add named-file. You'll be asked for a name. Give the name test. Then you will see options for an individual file, a JSON list of files, or adding a directory of files:

Select file. You will see a listing of your directory. Pick test.csv:

After CsvPath adds your input data file you go back to the top menu. This time select named-paths and then add named-paths. You should see:

You'll be asked for a name. Give the name lineage_example. You will again be asked if you are picking a file of csvpaths, a directory, or a JSON file. Again pick file. You will be presented with your directory:

Pick your lineage_example.csvpath file. And you're done with that part of the setup. Next let's modify the config.ini slightly.

We also need to uncomment the [listeners] and [marquez] settings. When you've made those changes your config file should look like:

Notice we made the archive name Sunshine_Inc. Do the same. Marquez doesn't like spaces so be sure to use the _.

At the top level select run:

You will be asked to pick the file to run from a list. There is one option, so pick that.

Next you will be asked for the named-paths group. Again you'll have a list of one, so pick the one.

And finally you'll be asked to pick a run strategy by method name. If you've been doing other examples you'll know that collect keeps the matching rows and fast-forward does not. For our purposes it doesn't matter which we choose, but pick collect.

You should get a message indicating that your run completed:

We're good. We should see our run in Marquez.

Switch to the Jobs vertical tab on the left-hand side.

Then look at the top right for the namespaces dropdown. Select Sunshine_Inc. If you don't see our namespace right away, refresh the page.

You should now see your job events!

And there you have it. A local install of Marquez integrated with a CsvPath project. Clarity and consistency! Not bad for a few minutes work. And a good start on the journey to stronger edge governance and operational efficiency.

To start, create a new CsvPath project. As usual we'll use Poetry, but of course you can use Pip or any Python project tool. Call your project lineage_example.

Install , if you don't already have it. You'll need to create a Dockerhub account. It should be painless.

Next install Marquez. because it's interesting and tells you much more about Marquez than our page does.

Back to CsvPath. Create a file in your project directory called test.csv. Paste in our usual test data.

Before we can run your files we need to stage them in the CsvPath framework's inputs directory. We also need to tell the CsvPath library that it should send events to Marquez. We'll add the files first because that will give CsvPath the opportunity to create directories and config files for us.

Open ./config/config.ini. We want to make two changes. First we'll change the archive name. You don't really need to do this, but since your example isn't real work, why not separate it?

Now we can run our csvpath! Restart your CLI so it has your config changes.

Open . Remember, you'll start out looking at the default namespace. default is empty. We pushed our events to Sunshine_Inc.

Click on to open your core job. There will be other jobs that were for staging assets. You can learn about everything you are seeing on other pages of this site.

Group:lineage_example.Instance:first lin...
OpenLineage
Marquez
Docker desktop
Read this page
http://localhost:3000/
An end-to-end lineage schematic