Getting Started with CsvPath + SFTPPlus

Create an effective automated MFT solution in minutes

PreviousGetting Started With CsvPath + OpenLineage NextSFTPPlus Implementation Checklist

Last updated 4 months ago

Getting Started with CsvPath + SFTPPlus

Create an effective automated MFT solution in minutes

CsvPath exists to close the gap between managed file transfer (MFT) and the data lake. SFTPPlus is a managed service that provides a secure route into the organization that is manageable at scale.

Combining SFTPPlus and CsvPath enables you to land your data securely, automatically register its identity, check its validity, and canonicalize it, and stage it for efficient downstream use. Likewise, channeling the outbound flow through CsvPath into SFTPPlus and out to external data partners is a robust solution for effective data partnerships.

Sounds great. How do we set it up? In fact, it's pretty simple. We'll work on the assumption you have an SFTPPlus server installed. (You can get a ). We'll also assume you or your SFTPPlus admin are comfortable enough with Python to follow the simple steps that setup a CsvPath project. We'll use for that.

Requirements

The goal is to allow a CsvPath writer to create a landing zone for files from an external data partner. The requirements are:

The files are sent via SFTP to an account owned by the data partner who is sending them
On landing, the files are staged in CsvPath for processing
A named-paths group of csvpaths is applied to each file that arrives
In this processing, the arriving files are
- Given a durable identity
- Checked for validity
- Canonicalized as needed
- Staged for ETL in the Archive, possibly with copies transferred via S3, SFTP, locally, etc.
Do all this in an automated way mainly under the control of the csvpath writer with the minimum of setup help

That's what we're going for.

The Steps

Our steps are, for the most part, the same as for any CsvPath integration. However, in this case we are going to run two CsvPath projects. One is for the CsvPath user. The other is for the server automation.

On the csvpath writer's side the actions are:

Add SFTPPath information to the user's config/config.ini file
Add SFTPPath directives to the named-paths

And for the SFTPPlus admin, they will:

Set up the data partner's account
Create a mailbox transfer to receive instructions from CsvPath
Add information to the server-side CsvPath project's config/config.ini file

Who Knows What?

There are several pieces of information that need to be shared. For example, the csvpath writer needs to know the type(s) of files they will be receiving from the data partner. And the data partner needs to know the server and account details from the SFTPPlus admin.

To be clear, here are the communication flows:

To be sure, this information has to be exchanged by these same people regardless of if CsvPath and SFTPPlus are used. We're just putting it into specific places to facilitate an automated process.

Who does what?

You, as the DataOps driver, just add a few directives to your csvpath files. The DevOps team sets up the server-side of the integration. And the data partner starts dropping files in an FTP directory.

Here's an image of who owns what. Green for the data person. Blue for the DevOps ownership. And red for the data partner's contribution.

And, just to be clear, these are setup steps. Once the SFTPPlus solution is in place any number of DataOps team members can work with any number of data partners without needing anything beyond account names and logins.

Setting Up the DataOps Side

Here you are setting up the client-side. This part is easy. You simply:

Receive mailbox connection information
Start adding SFTPPlus directives to your csvpaths

Let's assume you like Poetry as much as we do. Open a console and create a new Poetry project called sftpplus.

poetry new sftpplus

Change into your new project and add CsvPath:

poetry add csvpath

Next fire up the CsvPath CLI just for a moment to allow CsvPath to create your config file.

poetry run cli

When the CLI comes up just select quit and hit enter. CsvPath generated your config.ini so we're done with it for now.

Now, open config/config.ini. We're going to add some information about our SFTPPlus server. Look for [sftpplus]. Keep in mind that if you didn't just create this project you may need to update your csvpath version.

Edit these four fields to have appropriate values, or if you don't see them in your config.ini, add them. The username and password in the screenshot are in ALL CAPS. That tells CsvPath's SFTPPlus integration that it should look in your environment variables for the values of those names. You don't have to use env vars but that is a better approach than having passwords in config files.

You shouldn't need to add the SFTPPlus integration's listener, because CsvPath includes it in the generated config.ini. But you do have to indicate that you want to use it. Look for [listeners] groups and add sftpplus like this:

If you are using multiple integrations just separate the names with commas.

You're done in config.ini. Painless!

Next, pick a csvpath from your named-paths group. It can be any of them. You're going to be running the whole group against every arriving named-file, but only the csvpaths you decorate with the SFTPPlus instructions are automated in SFTPPlus.

The directives you can add are:

sftpplus-active
sftpplus-named-file-name
sftpplus-run-method
sftpplus-account-name
sftpplus-execute-timeout

sftpplus-active and sftp-execute-timeout are optional. The other three are mandatory, just as you would think.

Directive

Values

Description

sftpplus-active

yes or no or delete to delete the SFTPPlus transfer automation

Each time you add your named-paths group your transfer in SFTPPlus will be updated to be active or inactive. Active is the default.

If you make the value delete the SFTPPlus transfer will be deleted so no new files will be automatically processed. Nothing else will be deleted; only the SFTPPlus config changes.

sftpplus-named-file-name

Any named-file name you like

This is the name your inbound files will be referenced by. The physical file names themselves will likely change, but your automation will always run the current named-file against the current named-paths.

sftpplus-run-method

One of:

collect_paths
collect_by_line
fast_forward_paths
fast_forward_by_line

These are the main methods to run a named-paths group against a named-file. (the next_paths and next_by_line methods are not an option because those are not intended for this kind of automation).

sftpplus-account-name

Any SFTP account name

This is the account the data partner, or their automation, will log into to send their files.

sftpplus-execute-timeout

An integer number of seconds. The default is 500

This is the timeout for running two methods: add_named_file and the method named in sftplus-run-method. If you are storing all files on the local server your run times will be very quick — depending on your file sizes, of course. But if you are configured to store files in S3, a remote filesystem or SFTP your timeout should be considerably longer.

Your csvpath should look something like this:

Setting up the DevOps infrastructure

On the DevOps side of things — the server-side — there are three main activities, assuming your SFTPPlus server is already installed:

Create a Python CsvPath project on the server
Create the mailbox account
Create the data partner's account

Setting up the server-side CsvPath project

Creating the CsvPath project is quite easy. It requires a Python 3.10.5 or greater environment.

apt-get --no-install-recommends install -y pipx
pipx ensurepath
pipx install poetry

Next you may need a higher Python version. (Check the version by doing python3 --version). If you do, follow these steps. (Windows and Mac users will obviously have slightly different steps).

apt install software-properties-common
add-apt-repository ppa:deadsnakes/ppa
apt update
apt install python3.12

Creating the CsvPath integration project is simple:

cd to the directory wherever you plan to put your scripts. E.g. /opt/sftpplus/run or C:\sftp\run
poetry new transfers (or whatever project name you prefer)

If you need to raise your Python level, make that change as you create your project:

cd to project's desired parent directory. e.g. /opt/sftpplus/run
poetry add csvpath
You may need to edit pyproject.toml to get the correct Python version for CsvPath and its dependencies. If Poetry complains, change the Python requirement to:">=3.10,<4.0" or follow Poetry's suggestion.
Download the following 4 simple scripts and drop them in the root directory of your new project

(The .bat files are for Windows users. They are in the same location in GitHub).

These scripts are just trivial shims that allow SFTPPlus to call CsvPath.

On Linux, chmod the shell scripts to make them executable. chmod +x handle_auto_arrival.sh and chmod +x handle_mailbox_arrival.sh.

Update the paths in the scripts as needed. Check that the shell scripts run Poetry correctly by running them yourself. They will blow up because you won't be feeding them the right data, but you'll see if they run Poetry.

If not, change the path to Poetry used in the script. On Linux, do which poetry to see where poetry lives and update the .sh scripts with the right path. E.g. on my laptop, the scripts need to use /Users/sftpplus/.local/bin/poetry, rather than just poetry. So my handle_mailbox_arrival.sh looks like:

/Users/sftpplus/.local/bin/poetry install && /Users/sftpplus/.local/bin/poetry run python handle_mailbox_arrival.py "$1"

Setting up the mailbox and data partner accounts

The mailbox account is where the SFTPPlus integration sends instructions for creating automations. This activity is completely behind the scenes.

Open the SFTPPlus server admin console in your web browser. Create an account called mailbox. This account will be shared with all the csvpath writers.

Then, if it doesn't already exist, create an account for your data partner.

Make a storage folder for inbound content. In your storage area create the following directories:

Mailbox

./mailbox
./mailbox/handled

Data partner

./<<data-partner-name>>

You don't have to create any other directories, but you should know that the integration will create a directory for each named-file name and two more directories within.

./<<data-partner-name>>/<<named-file-name>>
./<<data-partner-name>>/<<named-file-name>>/handled
./<<data-partner-name>>/<<named-file-name>>/meta

Assign the root of these directories to the mailbox and <data partner> accounts as their respective storage areas. You do that in the SFTPPlus admin UI accounts page.

The final SFTPPlus step is to create a transfer for the mailbox account. CsvPath Framework users with the SFTPPath integration configured are connected to SFTPPlus. Behind the scenes the integration sends instructions to SFTPPlus when users load CsvPath Language files.

The instructions CsvPath sends go into the mailbox account. The mailbox transfer runs the handle_mailbox_arrival.sh (or handle_mailbox_arrival.bat) script, and on success, will move the incoming instructions to its handled directory. Those instructions tell SFTPPlus how to handle incoming data partner files.

To set this up, create a transfer called csvpath (or whatever you like). Set it to check the mailbox frequently. It should move files from ./mailbox to ./mailbox/handled. Before it moves files make it run handle_mailbox_arrival.sh. Be sure to set Overwriting rule to Overwrite existing file.

At this point you should be able to sftp into the mailbox and data partner accounts.

And you're done

Well, done but for testing, of course. The two manual tests you need to see working are:

Add one or more csvpaths to a named-paths group using the PathsManager.add_named_paths method and see a transfer created for the data partner and a metadata file show up in the partner's meta directory.
Drop a file in the partner's account and see it processed into the handled directory and its results show up in the CsvPath archive.

Questions?

Can multiple csvpath writers use the same inbound named-file for their separate named-paths group runs?

Yes, multiple DataOps people can use the same named-file with the SFTPPlus integration. You will see a separate JSON metadata file for each named-paths group that needs to run against an arriving file. The only constraint is that named-path group names must be unique within the namespace. (I.e. the combination of account + named-file name + named-paths name must be unique).

Are there other settings for the SFTPPlus server accounts and transfers?

Yes, possibly. You can configure your accounts and transfers pretty much any ways that make common sense. You may change the configuration of the automatically created transfers that CsvPath creates, within reason.

What information needs to be exchanged between the people setting this integration up?

The csvpath writer needs to know how to fill in the directives they add to the comment section of their csvpaths. So, they need server name, mailbox credentials, etc.

The DevOps person needs to know the data partner's information so they can create that account.

As a CsvPath writer, how do I know everything is working?

Great question. There are several ways to make sure things are going well.

If you have access to the SFTPPlus event log you will see the transfer creation and all its activity there
SFTPPlus allows you to set up email alerts that can help you understand when server actions happen.
Using an SFTP client you can easily watch the metadata file land in the mailbox and see the data partner's named-file directory, with its meta and handled subdirectories. Watching that progression and then sending a file and watching it be processed can take just a few seconds.
The CsvPath logs on both client and server will give you a good understanding of the step-by-step at each end.
As you are writing csvpaths and using the CLI don't forget to set CsvPath to raise exceptions and log on the DEBUG level during development. In the CLI there is an option to set those two configuration values, or just do it in config/config.ini.

The first three bullets are available to CsvPath writers on a csvpath-by-csvpath basis, without DataOps support once the configuration values are available. In general, though, the initial setup of an automated transfer is something that the CsvPath writer, their DataOps support, and the data partner will have to work together on to make sure the automation is buttoned up.

Once you have the automation in place you should consider using an observability tool to track when files arrive, if you are seeing the correct amount of data, and if the files process correctly. And if your file arrivals are routine, an observability tool will give you a way to get an alert if the expected processing didn't happen within a time period.

As a CsvPath writer, what if I need to make changes to the setup?

That's no problem. You can change your SFTPPlus metadata fields and reload your named-paths at any time. The integration will update the transfer.

What if my data partner sends a file with an unexpected name?

From the CsvPath Framework's point of view, each file that arrives is named by the directory your data partner puts it in. If you have the named-file name orders the integration will create an orders directory. Your data partner will drop their files in that directory.

Within CsvPath, each file will be tracked as part of a well-identified sequence of versions of data arriving with the same physical filename, all within the scope of the single named file name. For you as a CsvPath writer, nothing changes because you're working against the CsvPath Framework's named-file name, which doesn't change.

So, to be clear, if you have orders as your named-file name an orders directory is created. Your data partner may drop Jan-orders.xlsx, Feb-orders.xlsx, Mar-orders.xlsx and each of these will take its turn as the the named-file orders. If your partner drops three versions of Feb-orders.xlsx, CsvPath Framework will collect, fingerprint, and make all three available in sequence as orders. Likewise, if your data partner sends spring-picnic-menu.csv to the orders directory, CsvPath will track that file in its own sequence of data within the scope of orders — even though it seems like that data probably doesn't belong.

You can see the progression of each named-file in its manifest.json. Look in (by default) inputs/named_files.

When do the handled files get deleted?

That's up to you. Once a file lands in ./mailbox/handled or ./<<partner>>/handled it can be deleted. The instructions JSON files are copied to the ./<<partner>>/meta directory and the inbound data files are copied into CsvPath's file storage area as part of file registration. (CsvPath's named-files area is configured in the server-side CsvPath project's config/config.ini). The original data files are no longer needed.

Leaving the inbound files for a relatively brief period of days just in case the World blows up would be prudent. But immediately on processing the inbound files, the source of truth becomes CsvPath's data storage area.