CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • Why go with SQL?
  • Filesystem overwrites and races
  • Setting up SQL databases
  • Configure the listeners
  • Configure the database URI
  • The database structure
  1. Getting Started
  2. How-tos

Send events to MySQL or Postgres

Some multi-user archives may benefit from capturing events to the database

PreviousExecute a script at the end of a runNextSending results by SFTP

Last updated 2 months ago

CsvPath Framework is great for both individual developer users and large groups of DataOps pros taking care of many data partnerships. The difference between these scenarios is important, however. You can read more about it here.

Why go with SQL?

There are three obvious reasons to consider setting up a database for CsvPath Framework:

  • A SQL database is an immutable record of changes; whereas, manifest.json files can be overwritten or deleted if the archive and staging areas are not protected

  • In some cases other applications can more easily use SQL to get CsvPath Framework data

Filesystem overwrites and races

In general, CsvPath Framework assumes it has privileged access to protected file system space for its archive, the source file staging area, and the named-paths area. That is absolutely the right way to go in production. In dev environments, though, things can certainly be a bit looser.

If you are working with your peers in a shared archive or a shared source file staging or named-paths area you should consider using a SQL database to capture events. When you work on the same files you run the risk over overwrites. That is no different from any shared data system — for e.g. you can have the same problem in JIRA, Git, or Sharepoint.

However, CsvPath's filesystem data writes are necessarily not all atomic. In certain cases, there is a race condition around file system writes. That is typically not a consideration for automated runs. When CsvPath is automated typically there is a single Python process working on a particular run, not two Python processes competing over one run. Moreover, most CsvPath Framework automations will have their own protected area namespaced by named-paths name or archive name or both.

Still, in development environments and for some corner cases you might want your audit trail to have more immutability. For this the SQL databases have you covered. All metadata writes, whether to the manifest.json files in the file system, to a local Sqlite, or to a SQL database server, are handled by event listeners. You've probably seen the many optional listeners in your config/config.ini. When you write to the file system you overwrite any existing file in the location. But when your events go to SQL you are creating an even more immutable record that can disambiguate any cross-talk in the manifests due to a non-ideal setup of your automation jobs or a shared dev env without namespacing.

Setting up SQL databases

Regardless, to configure for SQL what you need is:

  • The listeners are enabled

  • Your database URL is available in config/config.ini

Configure the listeners

Open config/config.ini and look for the [listeners] section. It should have the SQL listeners already. But if your config file is not newly generated, it might not. Copy the keys below into your file, if needed.

sql.file = from csvpath.managers.integrations.sql.sql_file_listener import SqlFileListener
sql.paths = from csvpath.managers.integrations.sql.sql_paths_listener import SqlPathsListener
sql.result = from csvpath.managers.integrations.sql.sql_result_listener import SqlResultListener
sql.results = from csvpath.managers.integrations.sql.sql_results_listener import SqlResultsListener

Then look for the groups key in [listeners] and add sql to it, separating with a comma, if needed.

Configure the database URI

Check for a [sql] section. If your config.ini isn't newly generated copy the [sql] section from below into your file.

[sql]
dialect = mysql
connection_string = mysql://csvpath:password1@192.168.1.1/csvpath

The dialect key value must be one of:

  • mysql

  • postgres

  • sqlite

As usual, you can use an ALL CAPS value to point to the environment var of that name. For e.g., connection_string = CONNECTION_URI would result in connection_string equaling the value of the CONNECTION_URI environment variable, if found.

That's all the configuration needed.

The database structure

Your data is now ready to flow into the database. CsvPath Framework will create the database for you in the background. The database is csvpath and the tables are:

  • instance_run

  • named_file

  • named_paths

  • named_paths_group_run

You are capturing:

  • NAMED_FILE: The same information as in the manifest.json created when you add a named-file

  • NAMED_PATHS: The same for named-paths

  • NAMED_PATHS_GROUP_RUN: The run manifest for the named-paths group run found in its run_dir (a.k.a. the run's home)

  • INSTANCE_RUN: The run manifest for each individual csvpath in the named-paths group being run

The database is quite simple. It is nothing more than an immutable record (unless you mutate it outside of CsvPath) of the assets and runs your CsvPath Framework performs. That's all it needs to be.

And there you have it. All your events captured to your favorite database.

Querying SQL is much more productive for some purposes, compared to going through the manifest.json files.

Setting up SQL events is not hard at all. Still, as you can tell, we would only suggest it if the benefit is worth running a server, managing the four tables, and distributing credentials. If you're just in need of better layout of your archive and/or named-paths, obviously that's the even easier and more correct thing to focus on. For most folks, that's all you need. And, if you're just in need of a way to query your results, . As you can tell, we're all about a low-friction .

Under the hood we're using , so you can .

Sqlite also solves for this locally.
a local Sqlite may be a more lightweight solution
DX
SQLAlchemy
look here for the connection string formats
You may already have this SQL section if you generated your config recently
the logos of mysql, postgres, and sqlite
what your config file should look like. same as the text below.
screenshot of a config file, same as the text below
screenshot of the mysql command line client
an ERD generated in Mysql Workbench showing the four tables.