AWS S3

CsvPath's three file stores can be local or on S3. Mix and match!

All five backends work exactly the same. The main difference is in the protocol part of the URI:

azure://
sftp://
s3://
gs://
/ or c:\

(HTTP is read-only, so doesn't count as a storage backend. It is only for loading files.)

Each backend has its own way of storing credentials, of course. In the case of SFTP there is a form in the config panel's integrations tab. The three cloud stores use env vars that you can set in the OS, in FlightPath Data, or in a FlightPath Server project.

There are three places CsvPath keeps files:

The archive — the place where results live
Inputs:
- Named files — all the source Excel and CSV data files
- Named paths — your named groups of csvpaths

Each of these can be independently placed. By default Archive is the folder at ./archive. Named-files and named-paths default to:

./inputs/named_files
./inputs/named_paths

All of these locations and names can be changed. Keep in mind that the Archive takes its name from the last part of its path. That means that if you put your archive at ./this/is/my/stuff your archive will be named stuff. In most cases that doesn't matter, but when we're tying into other systems, such as the CKAN or Marquez integrations, the archive name is meaningful.

How do you set up the three file storage locations? Easy, just change your settings. There are three settings in config.ini. By default config.ini is in ./config/config.ini. The settings you need are in the [results] and [inputs] sections. Archive is set under [results] with the archive key. Named-files and named-paths are set under [inputs] using the files and csvpaths keys, respectively.

If you'd like your results to go to an archive in S3 all you need to do is set the archive key to an S3 URI like:

archive = s3://csvpath-example-1/archive

Using this setting would send all results to the csvpath-example-1 bucket with paths beginning with archive. (S3 doesn't truly have directories, but in effect, everything goes into the archive directory).

Likewise, to store your source data in S3 you would set a key like:

files = s3://csvpath-example-1/inputs/named_files

As usual with AWS you will need to authenticate at the command line. This can be as simple as exporting your SK and AK as env vars. See AWS's docs.

Bear in mind that as soon as you separate your data and compute you incur a network latency cost, as well as actual dollars and cents costs. There are ways to mitigate the latency and moving your compute to AWS along with your data is likely to be a big help. All-in-all, using S3 is sweet, but as with any work you do using CsvPath, try, test, automate, and iterate.

PreviousStorage Backends NextAzure

Last updated 51 minutes ago