Getting Started with CsvPath + SFTPPlus
Create an effective automated MFT solution in minutes
Last updated
Create an effective automated MFT solution in minutes
Last updated
CsvPath exists to close the gap between managed file transfer (MFT) and the data lake. SFTPPlus is a managed service that provides a secure route into the organization that is manageable at scale.
Combining SFTPPlus and CsvPath enables you to land your data securely, automatically register its identity, check its validity, and canonicalize it, and stage it for efficient downstream use. Likewise, channeling the outbound flow through CsvPath into SFTPPlus and out to external data partners is a robust solution for effective data partnerships.
Sounds great. How do we set it up? In fact, it's pretty simple. We'll work on the assumption you have an SFTPPlus server installed. (You can get a trial here). We'll also assume you or your SFTPPlus admin are comfortable enough with Python to follow the simple steps that setup a CsvPath project. We'll use Poetry for that.
The goal is to allow a CsvPath writer to create a landing zone for files from an external data partner. The requirements are:
The files are sent via SFTP to an account owned by the data partner who is sending them
On landing, the files are staged in CsvPath for processing
A named-paths group of csvpaths is applied to each file that arrives
In this processing, the arriving files are
Given a durable identity
Checked for validity
Canonicalized as needed
Staged for ETL in the Archive, possibly with copies transferred via S3, SFTP, locally, etc.
Do all this in an automated way mainly under the control of the csvpath writer with the minimum of setup help
That's what we're going for.
Our steps are, for the most part, the same as for any CsvPath integration. However, in this case we are going to run two CsvPath projects. One is for the CsvPath user. The other is for the server automation.
On the csvpath writer's side the actions are:
Add SFTPPath information to the user's config/config.ini file
Add SFTPPath directives to the named-paths
And for the SFTPPlus admin, they will:
Set up the data partner's account
Create a mailbox transfer to receive instructions from CsvPath
Add information to the server-side CsvPath project's config/config.ini
file
Add trivial scripts (downloadable from CsvPath's Github) to connect the wires
In concept, this is all pretty straightforward. There are a lot of details, of course. Getting it working will likely take a morning and a couple cups of coffee. Here's a high-level checklist of the steps.
There are several pieces of information that need to be shared. For example, the csvpath writer needs to know the type(s) of files they will be receiving from the data partner. And the data partner needs to know the server and account details from the SFTPPlus admin.
To be clear, here are the communication flows:
To be sure, this information has to be exchanged by these same people regardless of if CsvPath and SFTPPlus are used. We're just putting it into specific places to facilitate an automated process.
Setting up the SFTP integration is straightforward. Again, here is the high-level implementation checklist.
You, as the DataOps driver, just add a few directives to your csvpath files. The DevOps team sets up the server-side of the integration. And the data partner starts dropping files in an FTP directory.
Here's an image of who owns what. Green for the data person. Blue for the DevOps ownership. And red for the data partner's contribution.
And, just to be clear, these are setup steps. Once the SFTPPlus solution is in place any number of DataOps team members can work with any number of data partners without needing anything beyond account names and logins.
Here you are setting up the client-side. This part is easy. You simply:
Receive mailbox connection information
Start adding SFTPPlus directives to your csvpaths
Let's assume you like Poetry as much as we do. Open a console and create a new Poetry project called sftpplus.
Change into your new project and add CsvPath:
Next fire up the CsvPath CLI just for a moment to allow CsvPath to create your config file.
When the CLI comes up just select quit
and hit enter. CsvPath generated your config.ini
so we're done with it for now.
Now, open config/config.ini
. We're going to add some information about our SFTPPlus server. Look for [sftpplus]
. Keep in mind that if you didn't just create this project you may need to update your csvpath version.
Edit these four fields to have appropriate values, or if you don't see them in your config.ini
, add them. The username and password in the screenshot are in ALL CAPS. That tells CsvPath's SFTPPlus integration that it should look in your environment variables for the values of those names. You don't have to use env vars but that is a better approach than having passwords in config files.
You shouldn't need to add the SFTPPlus integration's listener, because CsvPath includes it in the generated config.ini
. But you do have to indicate that you want to use it. Look for [listeners] groups
and add sftpplus like this:
If you are using multiple integrations just separate the names with commas.
You're done in config.ini
. Painless!
Next, pick a csvpath from your named-paths group. It can be any of them. You're going to be running the whole group against every arriving named-file, but only the csvpaths you decorate with the SFTPPlus instructions are automated in SFTPPlus.
The directives you can add are:
sftpplus-active
sftpplus-named-file-name
sftpplus-run-method
sftpplus-account-name
sftpplus-execute-timeout
sftpplus-active
and sftp-execute-timeout
are optional. The other three are mandatory, just as you would think.
sftpplus-active
yes
or no
or delete
to delete the SFTPPlus transfer automation
Each time you add your named-paths group your transfer in SFTPPlus will be updated to be active or inactive. Active is the default.
If you make the value delete
the SFTPPlus transfer will be deleted so no new files will be automatically processed. Nothing else will be deleted; only the SFTPPlus config changes.
sftpplus-named-file-name
Any named-file name you like
This is the name your inbound files will be referenced by. The physical file names themselves will likely change, but your automation will always run the current named-file against the current named-paths.
sftpplus-run-method
One of:
collect_paths
collect_by_line
fast_forward_paths
fast_forward_by_line
These are the main methods to run a named-paths group against a named-file. (the next_paths
and next_by_line
methods are not an option because those are not intended for this kind of automation).
sftpplus-account-name
Any SFTP account name
This is the account the data partner, or their automation, will log into to send their files.
sftpplus-execute-timeout
An integer number of seconds. The default is 500
This is the timeout for running two methods: add_named_file
and the method named in sftplus-run-method
. If you are storing all files on the local server your run times will be very quick — depending on your file sizes, of course. But if you are configured to store files in S3, a remote filesystem or SFTP your timeout should be considerably longer.
Your csvpath should look something like this:
On the DevOps side of things — the server-side — there are three main activities, assuming your SFTPPlus server is already installed:
Create a Python CsvPath project on the server
Create the mailbox account
Create the data partner's account
Creating the CsvPath project is quite easy. It requires a Python 3.10.5 or greater environment.
First we add Pipx and Poetry. Pipx keeps Python applications from getting in each other's way. Poetry is our Python project tool. For the official SFTPPlus Ubuntu docker container, the commands are as follows. Otherwise, if you are installing on Windows you could use Scoop or on MacOS Homebrew.
apt-get --no-install-recommends install -y pipx
pipx ensurepath
pipx install poetry
Next you may need a higher Python version. (Check the version by doing python3 --version
). If you do, follow these steps. (Windows and Mac users will obviously have slightly different steps).
apt install software-properties-common
add-apt-repository ppa:deadsnakes/ppa
apt update
apt install python3.12
Creating the CsvPath integration project is simple:
cd to the directory wherever you plan to put your scripts. E.g. /opt/sftpplus/run
or C:\sftp\run
poetry new transfers
(or whatever project name you prefer)
If you need to raise your Python level, make that change as you create your project:
cd
to project's desired parent directory. e.g. /opt/sftpplus/run
poetry add csvpath
You may need to edit pyproject.toml
to get the correct Python version for CsvPath and its dependencies. If Poetry complains, change the Python requirement to:">=3.10,<4.0"
or follow Poetry's suggestion.
Download the following 4 simple scripts and drop them in the root directory of your new project
The 4 scripts live in CsvPath's Github here. They are:
(The .bat
files are for Windows users. They are in the same location in GitHub).
These scripts are just trivial shims that allow SFTPPlus to call CsvPath.
On Linux, chmod
the shell scripts to make them executable. chmod +x handle_auto_arrival.sh
and chmod +x handle_mailbox_arrival.sh
.
Update the paths in the scripts as needed. Check that the shell scripts run Poetry correctly by running them yourself. They will blow up because you won't be feeding them the right data, but you'll see if they run Poetry.
If not, change the path to Poetry used in the script. On Linux, do which poetry
to see where poetry lives and update the .sh scripts with the right path. E.g. on my laptop, the scripts need to use /Users/sftpplus/.local/bin/poetry
, rather than just poetry
. So my handle_mailbox_arrival.sh
looks like:
/Users/sftpplus/.local/bin/poetry install && /Users/sftpplus/.local/bin/poetry run python handle_mailbox_arrival.py "$1"
The mailbox account is where the SFTPPlus integration sends instructions for creating automations. This activity is completely behind the scenes.
Open the SFTPPlus server admin console in your web browser. Create an account called mailbox
. This account will be shared with all the csvpath writers.
Then, if it doesn't already exist, create an account for your data partner.
Make a storage folder for inbound content. In your storage area create the following directories:
./mailbox
./mailbox/handled
./<<data-partner-name>>
You don't have to create any other directories, but you should know that the integration will create a directory for each named-file name and two more directories within.
./<<data-partner-name>>/<<named-file-name>>
./<<data-partner-name>>/<<named-file-name>>/handled
./<<data-partner-name>>/<<named-file-name>>/meta
Assign the root of these directories to the mailbox
and <data partner> accounts as their respective storage areas. You do that in the SFTPPlus admin UI accounts page.
The final SFTPPlus step is to create a transfer for the mailbox
account. CsvPath Framework users with the SFTPPath integration configured are connected to SFTPPlus. Behind the scenes the integration sends instructions to SFTPPlus when users load CsvPath Language files.
The instructions CsvPath sends go into the mailbox
account. The mailbox
transfer runs the handle_mailbox_arrival.sh
(or handle_mailbox_arrival.bat)
script, and on success, will move the incoming instructions to its handled
directory. Those instructions tell SFTPPlus how to handle incoming data partner files.
To set this up, create a transfer called csvpath
(or whatever you like). Set it to check the mailbox
frequently. It should move files from ./mailbox
to ./mailbox/handled
. Before it moves files make it run handle_mailbox_arrival.sh
. Be sure to set Overwriting rule
to Overwrite existing file
.
At this point you should be able to sftp into the mailbox and data partner accounts.
Well, done but for testing, of course. The two manual tests you need to see working are:
Add one or more csvpaths to a named-paths group using the PathsManager.add_named_paths method
and see a transfer created for the data partner and a metadata file show up in the partner's meta
directory.
Drop a file in the partner's account and see it processed into the handled
directory and its results show up in the CsvPath archive.
You can use the CLI for the first test to make it quick and code-free. You can read an example of how to do it here. Assuming your accounts are on the local server, you should see something like the below. In this case tinpenny
is the data partner, orders
is the named-file-name, and sftpplus
is the named-paths group name:
Can multiple csvpath writers use the same inbound named-file for their separate named-paths group runs?
Yes, multiple DataOps people can use the same named-file with the SFTPPlus integration. You will see a separate JSON metadata file for each named-paths group that needs to run against an arriving file. The only constraint is that named-path group names must be unique within the namespace. (I.e. the combination of account + named-file name + named-paths name must be unique).
Are there other settings for the SFTPPlus server accounts and transfers?
Yes, possibly. You can configure your accounts and transfers pretty much any ways that make common sense. You may change the configuration of the automatically created transfers that CsvPath creates, within reason.
What information needs to be exchanged between the people setting this integration up?
The csvpath writer needs to know how to fill in the directives they add to the comment section of their csvpaths. So, they need server name, mailbox credentials, etc.
The DevOps person needs to know the data partner's information so they can create that account.
As a CsvPath writer, how do I know everything is working?
Great question. There are several ways to make sure things are going well.
Add a Slack notification to your csvpath so that you can see the processing happen when files arrive
Use Marquez or another OpenLineage server to track file arrival events
Use an observability tool like Grafana, New Relic, or another OpenTelemetry system to see the arrivals
If you have access to the SFTPPlus event log you will see the transfer creation and all its activity there
SFTPPlus allows you to set up email alerts that can help you understand when server actions happen.
Using an SFTP client you can easily watch the metadata file land in the mailbox and see the data partner's named-file directory, with its meta and handled subdirectories. Watching that progression and then sending a file and watching it be processed can take just a few seconds.
The CsvPath logs on both client and server will give you a good understanding of the step-by-step at each end.
As you are writing csvpaths and using the CLI don't forget to set CsvPath to raise exceptions and log on the DEBUG
level during development. In the CLI there is an option to set those two configuration values, or just do it in config/config.ini
.
The first three bullets are available to CsvPath writers on a csvpath-by-csvpath basis, without DataOps support once the configuration values are available. In general, though, the initial setup of an automated transfer is something that the CsvPath writer, their DataOps support, and the data partner will have to work together on to make sure the automation is buttoned up.
Once you have the automation in place you should consider using an observability tool to track when files arrive, if you are seeing the correct amount of data, and if the files process correctly. And if your file arrivals are routine, an observability tool will give you a way to get an alert if the expected processing didn't happen within a time period.
As a CsvPath writer, what if I need to make changes to the setup?
That's no problem. You can change your SFTPPlus metadata fields and reload your named-paths at any time. The integration will update the transfer.
What if my data partner sends a file with an unexpected name?
From the CsvPath Framework's point of view, each file that arrives is named by the directory your data partner puts it in. If you have the named-file name orders
the integration will create an orders
directory. Your data partner will drop their files in that directory.
Within CsvPath, each file will be tracked as part of a well-identified sequence of versions of data arriving with the same physical filename, all within the scope of the single named file name. For you as a CsvPath writer, nothing changes because you're working against the CsvPath Framework's named-file name, which doesn't change.
So, to be clear, if you have orders
as your named-file name an orders
directory is created. Your data partner may drop Jan-orders.xlsx
, Feb-orders.xlsx
, Mar-orders.xlsx
and each of these will take its turn as the the named-file orders
. If your partner drops three versions of Feb-orders.xlsx
, CsvPath Framework will collect, fingerprint, and make all three available in sequence as orders
. Likewise, if your data partner sends spring-picnic-menu.csv
to the orders
directory, CsvPath will track that file in its own sequence of data within the scope of orders
— even though it seems like that data probably doesn't belong.
You can see the progression of each named-file in its manifest.json
. Look in (by default) inputs/named_files
.
When do the handled files get deleted?
That's up to you. Once a file lands in ./mailbox/handled
or ./<<partner>>/handled
it can be deleted. The instructions JSON files are copied to the ./<<partner>>/meta
directory and the inbound data files are copied into CsvPath's file storage area as part of file registration. (CsvPath's named-files area is configured in the server-side CsvPath project's config/config.ini
). The original data files are no longer needed.
Leaving the inbound files for a relatively brief period of days just in case the World blows up would be prudent. But immediately on processing the inbound files, the source of truth becomes CsvPath's data storage area.
Do we need a mailbox for every different data partner?
No, your SFTPPlus just needs one mailbox
account for CsvPath to use no matter how many data partners or inbound files you have.
The configuration feels like a lot, can I get help?
Conceptually it's all pretty straightforward. But for sure there are lots of details. File and process permissions, in particular, can be a pain in the neck. We're happy to give help and advice — use the contact form here. And if there's a bigger need that goes beyond the SFTPPath integration setup we can point you to partners.