Send events to MySQL or Postgres
Some multi-user archives may benefit from capturing events to the database
Last updated
Some multi-user archives may benefit from capturing events to the database
Last updated
CsvPath Framework is great for both individual developer users and large groups of DataOps pros taking care of many data partnerships. The difference between these scenarios is important, however. You can read more about it here.
There are three obvious reasons to consider setting up a database for CsvPath Framework:
Querying SQL is much more productive for some purposes, compared to going through the manifest.json
files. Sqlite also solves for this locally.
A SQL database is an immutable record of changes; whereas, manifest.json
files can be overwritten or deleted if the archive and staging areas are not protected
In some cases other applications can more easily use SQL to get CsvPath Framework data
In general, CsvPath Framework assumes it has privileged access to protected file system space for its archive, the source file staging area, and the named-paths area. That is absolutely the right way to go in production. In dev environments, though, things can certainly be a bit looser.
If you are working with your peers in a shared archive or a shared source file staging or named-paths area you should consider using a SQL database to capture events. When you work on the same files you run the risk over overwrites. That is no different from any shared data system — for e.g. you can have the same problem in JIRA, Git, or Sharepoint.
However, CsvPath's filesystem data writes are necessarily not all atomic. In certain cases, there is a race condition around file system writes. That is typically not a consideration for automated runs. When CsvPath is automated typically there is a single Python process working on a particular run, not two Python processes competing over one run. Moreover, most CsvPath Framework automations will have their own protected area namespaced by named-paths name or archive name or both.
Still, in development environments and for some corner cases you might want your audit trail to have more immutability. For this the SQL databases have you covered. All metadata writes, whether to the manifest.json
files in the file system, to a local Sqlite, or to a SQL database server, are handled by event listeners. You've probably seen the many optional listeners in your config/config.ini
. When you write to the file system you overwrite any existing file in the location. But when your events go to SQL you are creating an even more immutable record that can disambiguate any cross-talk in the manifests due to a non-ideal setup of your automation jobs or a shared dev env without namespacing.
Setting up SQL events is not hard at all. Still, as you can tell, we would only suggest it if the benefit is worth running a server, managing the four tables, and distributing credentials. If you're just in need of better layout of your archive and/or named-paths, obviously that's the even easier and more correct thing to focus on. For most folks, that's all you need. And, if you're just in need of a way to query your results, a local Sqlite may be a more lightweight solution. As you can tell, we're all about a low-friction DX.
Regardless, to configure for SQL what you need is:
The listeners are enabled
Your database URL is available in config/config.ini
Open config/config.ini
and look for the [listeners]
section. It should have the SQL listeners already. But if your config file is not newly generated, it might not. Copy the keys below into your file, if needed.
Then look for the groups
key in [listeners]
and add sql
to it, separating with a comma, if needed.
Check for a [sql]
section. If your config.ini
isn't newly generated copy the [sql]
section from below into your file.
The dialect
key value must be one of:
mysql
postgres
sqlite
Under the hood we're using SQLAlchemy, so you can look here for the connection string formats.
As usual, you can use an ALL CAPS value to point to the environment var of that name. For e.g., connection_string = CONNECTION_URI
would result in connection_string
equaling the value of the CONNECTION_URI
environment variable, if found.
That's all the configuration needed.
Your data is now ready to flow into the database. CsvPath Framework will create the database for you in the background. The database is csvpath
and the tables are:
instance_run
named_file
named_paths
named_paths_group_run
You are capturing:
NAMED_FILE
: The same information as in the manifest.json
created when you add a named-file
NAMED_PATHS
: The same for named-paths
NAMED_PATHS_GROUP_RUN
: The run manifest for the named-paths group run found in its run_dir
(a.k.a. the run's home)
INSTANCE_RUN
: The run manifest for each individual csvpath in the named-paths group being run
The database is quite simple. It is nothing more than an immutable record (unless you mutate it outside of CsvPath) of the assets and runs your CsvPath Framework performs. That's all it needs to be.
And there you have it. All your events captured to your favorite database.