CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • Requirements
  • The Steps
  • Who Knows What?
  • Who does what?
  • Setting Up the DataOps Side
  • Setting up the DevOps infrastructure
  • Setting up the server-side CsvPath project
  • Setting up the mailbox and data partner accounts
  • And you're done
  • Questions?
  1. Getting Started
  2. DataOps Integrations

Getting Started with CsvPath + SFTPPlus

Create an effective automated MFT solution in minutes

PreviousGetting Started With CsvPath + OpenLineageNextSFTPPlus Implementation Checklist

Last updated 4 months ago

CsvPath exists to close the gap between managed file transfer (MFT) and the data lake. SFTPPlus is a managed service that provides a secure route into the organization that is manageable at scale.

Combining SFTPPlus and CsvPath enables you to land your data securely, automatically register its identity, check its validity, and canonicalize it, and stage it for efficient downstream use. Likewise, channeling the outbound flow through CsvPath into SFTPPlus and out to external data partners is a robust solution for effective data partnerships.

Sounds great. How do we set it up? In fact, it's pretty simple. We'll work on the assumption you have an SFTPPlus server installed. (You can get a ). We'll also assume you or your SFTPPlus admin are comfortable enough with Python to follow the simple steps that setup a CsvPath project. We'll use for that.

Requirements

The goal is to allow a CsvPath writer to create a landing zone for files from an external data partner. The requirements are:

  • The files are sent via SFTP to an account owned by the data partner who is sending them

  • On landing, the files are staged in CsvPath for processing

  • A named-paths group of csvpaths is applied to each file that arrives

  • In this processing, the arriving files are

    • Given a durable identity

    • Checked for validity

    • Canonicalized as needed

    • Staged for ETL in the Archive, possibly with copies transferred via S3, SFTP, locally, etc.

  • Do all this in an automated way mainly under the control of the csvpath writer with the minimum of setup help

That's what we're going for.

The Steps

Our steps are, for the most part, the same as for any CsvPath integration. However, in this case we are going to run two CsvPath projects. One is for the CsvPath user. The other is for the server automation.

On the csvpath writer's side the actions are:

  • Add SFTPPath information to the user's config/config.ini file

  • Add SFTPPath directives to the named-paths

And for the SFTPPlus admin, they will:

  • Set up the data partner's account

  • Create a mailbox transfer to receive instructions from CsvPath

  • Add information to the server-side CsvPath project's config/config.ini file

Who Knows What?

There are several pieces of information that need to be shared. For example, the csvpath writer needs to know the type(s) of files they will be receiving from the data partner. And the data partner needs to know the server and account details from the SFTPPlus admin.

To be clear, here are the communication flows:

To be sure, this information has to be exchanged by these same people regardless of if CsvPath and SFTPPlus are used. We're just putting it into specific places to facilitate an automated process.

Who does what?

You, as the DataOps driver, just add a few directives to your csvpath files. The DevOps team sets up the server-side of the integration. And the data partner starts dropping files in an FTP directory.

Here's an image of who owns what. Green for the data person. Blue for the DevOps ownership. And red for the data partner's contribution.

And, just to be clear, these are setup steps. Once the SFTPPlus solution is in place any number of DataOps team members can work with any number of data partners without needing anything beyond account names and logins.

Setting Up the DataOps Side

Here you are setting up the client-side. This part is easy. You simply:

  • Receive mailbox connection information

  • Start adding SFTPPlus directives to your csvpaths

Let's assume you like Poetry as much as we do. Open a console and create a new Poetry project called sftpplus.

poetry new sftpplus

Change into your new project and add CsvPath:

poetry add csvpath

Next fire up the CsvPath CLI just for a moment to allow CsvPath to create your config file.

poetry run cli

When the CLI comes up just select quit and hit enter. CsvPath generated your config.ini so we're done with it for now.

Now, open config/config.ini. We're going to add some information about our SFTPPlus server. Look for [sftpplus]. Keep in mind that if you didn't just create this project you may need to update your csvpath version.

Edit these four fields to have appropriate values, or if you don't see them in your config.ini, add them. The username and password in the screenshot are in ALL CAPS. That tells CsvPath's SFTPPlus integration that it should look in your environment variables for the values of those names. You don't have to use env vars but that is a better approach than having passwords in config files.

You shouldn't need to add the SFTPPlus integration's listener, because CsvPath includes it in the generated config.ini. But you do have to indicate that you want to use it. Look for [listeners] groups and add sftpplus like this:

If you are using multiple integrations just separate the names with commas.

You're done in config.ini. Painless!

Next, pick a csvpath from your named-paths group. It can be any of them. You're going to be running the whole group against every arriving named-file, but only the csvpaths you decorate with the SFTPPlus instructions are automated in SFTPPlus.

The directives you can add are:

  • sftpplus-active

  • sftpplus-named-file-name

  • sftpplus-run-method

  • sftpplus-account-name

  • sftpplus-execute-timeout

sftpplus-active and sftp-execute-timeout are optional. The other three are mandatory, just as you would think.

Directive
Values
Description

sftpplus-active

yes or no or delete to delete the SFTPPlus transfer automation

Each time you add your named-paths group your transfer in SFTPPlus will be updated to be active or inactive. Active is the default.

If you make the value delete the SFTPPlus transfer will be deleted so no new files will be automatically processed. Nothing else will be deleted; only the SFTPPlus config changes.

sftpplus-named-file-name

Any named-file name you like

This is the name your inbound files will be referenced by. The physical file names themselves will likely change, but your automation will always run the current named-file against the current named-paths.

sftpplus-run-method

One of:

  • collect_paths

  • collect_by_line

  • fast_forward_paths

  • fast_forward_by_line

These are the main methods to run a named-paths group against a named-file. (the next_paths and next_by_line methods are not an option because those are not intended for this kind of automation).

sftpplus-account-name

Any SFTP account name

This is the account the data partner, or their automation, will log into to send their files.

sftpplus-execute-timeout

An integer number of seconds. The default is 500

This is the timeout for running two methods: add_named_file and the method named in sftplus-run-method. If you are storing all files on the local server your run times will be very quick — depending on your file sizes, of course. But if you are configured to store files in S3, a remote filesystem or SFTP your timeout should be considerably longer.

Your csvpath should look something like this:

Setting up the DevOps infrastructure

On the DevOps side of things — the server-side — there are three main activities, assuming your SFTPPlus server is already installed:

  • Create a Python CsvPath project on the server

  • Create the mailbox account

  • Create the data partner's account

Setting up the server-side CsvPath project

Creating the CsvPath project is quite easy. It requires a Python 3.10.5 or greater environment.

  • apt-get --no-install-recommends install -y pipx

  • pipx ensurepath

  • pipx install poetry

Next you may need a higher Python version. (Check the version by doing python3 --version). If you do, follow these steps. (Windows and Mac users will obviously have slightly different steps).

  • apt install software-properties-common

  • add-apt-repository ppa:deadsnakes/ppa

  • apt update

  • apt install python3.12

Creating the CsvPath integration project is simple:

  • cd to the directory wherever you plan to put your scripts. E.g. /opt/sftpplus/run or C:\sftp\run

  • poetry new transfers (or whatever project name you prefer)

If you need to raise your Python level, make that change as you create your project:

  • cd to project's desired parent directory. e.g. /opt/sftpplus/run

  • poetry add csvpath

  • You may need to edit pyproject.toml to get the correct Python version for CsvPath and its dependencies. If Poetry complains, change the Python requirement to:">=3.10,<4.0" or follow Poetry's suggestion.

  • Download the following 4 simple scripts and drop them in the root directory of your new project

(The .bat files are for Windows users. They are in the same location in GitHub).

These scripts are just trivial shims that allow SFTPPlus to call CsvPath.

On Linux, chmod the shell scripts to make them executable. chmod +x handle_auto_arrival.sh and chmod +x handle_mailbox_arrival.sh.

Update the paths in the scripts as needed. Check that the shell scripts run Poetry correctly by running them yourself. They will blow up because you won't be feeding them the right data, but you'll see if they run Poetry.

If not, change the path to Poetry used in the script. On Linux, do which poetry to see where poetry lives and update the .sh scripts with the right path. E.g. on my laptop, the scripts need to use /Users/sftpplus/.local/bin/poetry, rather than just poetry. So my handle_mailbox_arrival.sh looks like:

/Users/sftpplus/.local/bin/poetry install && /Users/sftpplus/.local/bin/poetry run python handle_mailbox_arrival.py "$1"

Setting up the mailbox and data partner accounts

The mailbox account is where the SFTPPlus integration sends instructions for creating automations. This activity is completely behind the scenes.

Open the SFTPPlus server admin console in your web browser. Create an account called mailbox. This account will be shared with all the csvpath writers.

Then, if it doesn't already exist, create an account for your data partner.

Make a storage folder for inbound content. In your storage area create the following directories:

Mailbox

  • ./mailbox

  • ./mailbox/handled

Data partner

  • ./<<data-partner-name>>

You don't have to create any other directories, but you should know that the integration will create a directory for each named-file name and two more directories within.

  • ./<<data-partner-name>>/<<named-file-name>>

  • ./<<data-partner-name>>/<<named-file-name>>/handled

  • ./<<data-partner-name>>/<<named-file-name>>/meta

Assign the root of these directories to the mailbox and <data partner> accounts as their respective storage areas. You do that in the SFTPPlus admin UI accounts page.

The final SFTPPlus step is to create a transfer for the mailbox account. CsvPath Framework users with the SFTPPath integration configured are connected to SFTPPlus. Behind the scenes the integration sends instructions to SFTPPlus when users load CsvPath Language files.

The instructions CsvPath sends go into the mailbox account. The mailbox transfer runs the handle_mailbox_arrival.sh (or handle_mailbox_arrival.bat) script, and on success, will move the incoming instructions to its handled directory. Those instructions tell SFTPPlus how to handle incoming data partner files.

To set this up, create a transfer called csvpath (or whatever you like). Set it to check the mailbox frequently. It should move files from ./mailbox to ./mailbox/handled. Before it moves files make it run handle_mailbox_arrival.sh. Be sure to set Overwriting rule to Overwrite existing file.

At this point you should be able to sftp into the mailbox and data partner accounts.

And you're done

Well, done but for testing, of course. The two manual tests you need to see working are:

  • Add one or more csvpaths to a named-paths group using the PathsManager.add_named_paths method and see a transfer created for the data partner and a metadata file show up in the partner's meta directory.

  • Drop a file in the partner's account and see it processed into the handled directory and its results show up in the CsvPath archive.

Questions?

  • Can multiple csvpath writers use the same inbound named-file for their separate named-paths group runs?

Yes, multiple DataOps people can use the same named-file with the SFTPPlus integration. You will see a separate JSON metadata file for each named-paths group that needs to run against an arriving file. The only constraint is that named-path group names must be unique within the namespace. (I.e. the combination of account + named-file name + named-paths name must be unique).

  • Are there other settings for the SFTPPlus server accounts and transfers?

Yes, possibly. You can configure your accounts and transfers pretty much any ways that make common sense. You may change the configuration of the automatically created transfers that CsvPath creates, within reason.

  • What information needs to be exchanged between the people setting this integration up?

The csvpath writer needs to know how to fill in the directives they add to the comment section of their csvpaths. So, they need server name, mailbox credentials, etc.

The DevOps person needs to know the data partner's information so they can create that account.

  • As a CsvPath writer, how do I know everything is working?

Great question. There are several ways to make sure things are going well.

  • If you have access to the SFTPPlus event log you will see the transfer creation and all its activity there

  • SFTPPlus allows you to set up email alerts that can help you understand when server actions happen.

  • Using an SFTP client you can easily watch the metadata file land in the mailbox and see the data partner's named-file directory, with its meta and handled subdirectories. Watching that progression and then sending a file and watching it be processed can take just a few seconds.

  • The CsvPath logs on both client and server will give you a good understanding of the step-by-step at each end.

  • As you are writing csvpaths and using the CLI don't forget to set CsvPath to raise exceptions and log on the DEBUG level during development. In the CLI there is an option to set those two configuration values, or just do it in config/config.ini.

The first three bullets are available to CsvPath writers on a csvpath-by-csvpath basis, without DataOps support once the configuration values are available. In general, though, the initial setup of an automated transfer is something that the CsvPath writer, their DataOps support, and the data partner will have to work together on to make sure the automation is buttoned up.

Once you have the automation in place you should consider using an observability tool to track when files arrive, if you are seeing the correct amount of data, and if the files process correctly. And if your file arrivals are routine, an observability tool will give you a way to get an alert if the expected processing didn't happen within a time period.

  • As a CsvPath writer, what if I need to make changes to the setup?

That's no problem. You can change your SFTPPlus metadata fields and reload your named-paths at any time. The integration will update the transfer.

  • What if my data partner sends a file with an unexpected name?

From the CsvPath Framework's point of view, each file that arrives is named by the directory your data partner puts it in. If you have the named-file name orders the integration will create an orders directory. Your data partner will drop their files in that directory.

Within CsvPath, each file will be tracked as part of a well-identified sequence of versions of data arriving with the same physical filename, all within the scope of the single named file name. For you as a CsvPath writer, nothing changes because you're working against the CsvPath Framework's named-file name, which doesn't change.

So, to be clear, if you have orders as your named-file name an orders directory is created. Your data partner may drop Jan-orders.xlsx, Feb-orders.xlsx, Mar-orders.xlsx and each of these will take its turn as the the named-file orders. If your partner drops three versions of Feb-orders.xlsx, CsvPath Framework will collect, fingerprint, and make all three available in sequence as orders. Likewise, if your data partner sends spring-picnic-menu.csv to the orders directory, CsvPath will track that file in its own sequence of data within the scope of orders — even though it seems like that data probably doesn't belong.

You can see the progression of each named-file in its manifest.json. Look in (by default) inputs/named_files.

  • When do the handled files get deleted?

That's up to you. Once a file lands in ./mailbox/handled or ./<<partner>>/handled it can be deleted. The instructions JSON files are copied to the ./<<partner>>/meta directory and the inbound data files are copied into CsvPath's file storage area as part of file registration. (CsvPath's named-files area is configured in the server-side CsvPath project's config/config.ini). The original data files are no longer needed.

Leaving the inbound files for a relatively brief period of days just in case the World blows up would be prudent. But immediately on processing the inbound files, the source of truth becomes CsvPath's data storage area.

  • Do we need a mailbox for every different data partner?

No, your SFTPPlus just needs one mailbox account for CsvPath to use no matter how many data partners or inbound files you have.

  • The configuration feels like a lot, can I get help?

Add trivial scripts () to connect the wires

In concept, this is all pretty straightforward. There are a lot of details, of course. Getting it working will likely take a morning and a couple cups of coffee. .

Setting up the SFTP integration is straightforward. .

First we add Pipx and Poetry. Pipx keeps Python applications from getting in each other's way. Poetry is our Python project tool. For the official SFTPPlus Ubuntu docker container, the commands are as follows. Otherwise, if you are installing on Windows you could use or on MacOS .

The 4 scripts . They are:

or

or

You can use the CLI for the first test to make it quick and code-free. You can read an . Assuming your accounts are on the local server, you should see something like the below. In this case tinpenny is the data partner, orders is the named-file-name, and sftpplus is the named-paths group name:

to your csvpath so that you can see the processing happen when files arrive

Use to track file arrival events

Use an observability tool like to see the arrivals

Conceptually it's all pretty straightforward. But for sure there are lots of details. File and process permissions, in particular, can be a pain in the neck. We're happy to give help and advice — . And if there's a bigger need that goes beyond the SFTPPath integration setup we can point you to partners.

downloadable from CsvPath's Github
Here's a high-level checklist of the steps
Again, here is the high-level implementation checklist
Scoop
Homebrew
live in CsvPath's Github here
handle_auto_arrival.py
handle_auto_arrival.sh
handle_auto_arrival.bat
handle_mailbox_arrival.py
handle_mailbox_arrival.sh
handle_mailbox_arrival.bat
example of how to do it here
Add a Slack notification
Marquez or another OpenLineage server
Grafana, New Relic, or another OpenTelemetry system
use the contact form here
trial here
Poetry
Automated data onboarding made easy!
Good communication is the bedrock for long-term data partnerships
Green for the person using CsvPath. Blue for IT setup. Red for the data partner sending data.
These four lines automate file arrivals from a data partner
After adding a named-paths group you should see something like this
You can configure debugging from the CLI's main menu