CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • Simplest CsvPath Runner Ever
  • Iterating Matches
  • A Multi-csvpath Starter
  • An Airflow Stub
  1. Topics
  2. Python

Python Starters

A few really basic scripts to get you started

PreviousPython vs. CsvPathNextProduct Comparisons

Last updated 3 months ago

The CsvPath library was build to be easy. You can do a lot with very little Python scaffolding. Let's look at some code. Obviously, you can do better and have more stringent requirements. But starting from a blank page is harder than editing. As you get started you should probably keep open in a tab.

Simplest CsvPath Runner Ever

import sys
from csvpath import CsvPath

if __name__ == "__main__":
    with open(sys.argv[1]) as file:
        csvpath = file.read()
        path = CsvPath().fast_forward(csvpath)

The script above reads a file you give it as a command line argument. The file is a csvpath that must include the file it is being run against. It might look like this:

$my_test_file.csv[print("$.headers.firstname was born on $.headers.dob")]

After parsing the csvpath the script fast-forwards through all CSV lines. Any print statements or other side effects happen, as you would expect, and you don't have to iterate or collect the lines. If your csvpath marks the file invalid you can see that on the is_valid property of path.

Iterating Matches

import sys
from csvpath import CsvPath

if __name__ == "__main__":
    with open(sys.argv[1]) as file:
        csvpath = file.read()
        path = CsvPath().parse(csvpath)
        for line in path.next():
            print(f"{line}")

This script, like the first, takes a file path as a command line argument. In this case we are iterating on all the lines that match and printing them out.

A Multi-csvpath Starter

from csvpath import CsvPaths

if __name__=="__main__":
    paths = CsvPaths()
    paths.paths_manager.add_named_paths_from_file(name="autogen", file_path="assets/created_by_autogen.csvpath")
    paths.file_manager.add_named_file(name="usage", path="assets/usage_report_excerpt.csv")
    paths.fast_forward_paths(pathsname="autogen", filename="usage")
    results = paths.results_manager.get_named_results("autogen")

In this case we are setting up a number of csvpath statements that live in a single file. They will run against a usage report CSV. We do fast_forward_paths to run the csvpaths sequentially without collecting or iterating on the matched lines.

An Airflow Stub

from __future__ import annotations

import json
import pendulum
from airflow.decorators import dag, task
from csvpath import CsvPaths

@dag(
    schedule=None,
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["csvpaths validation"],
)
def validation_api():

    @task(multiple_outputs=True)
    def validate():
        paths = CsvPaths()
        paths.file_manager.add_named_files_from_dir("./example_2_part_2/csvs")
        paths.paths_manager.add_named_paths_from_dir(directory="./example_2_part_2/csvpaths")
        paths.collect_paths(filename="March-2024", pathsname="orders")
        results = paths.results_manager.get_named_results("orders")
        return {"valid":paths.results_manager.is_valid('orders'), "results_count":len(paths.results_manager.get_named_results('orders'))}

    validate()

validation_api()    

There's nothing notable about this Airflow stub. To, I'm sure, nobody's surprise, the CsvPath library fits well in an Airflow solution. This code is just a placeholder for the awesome Airflow things you will do.

typically auto-generates multi-csvpath validations based on your example data. These require a CsvPaths instance. Setting up CsvPaths is not very different.

the Python public interface docs
CsvPath AutoGen