CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • Happy path
  • Let's fix that name
  1. Getting Started
  2. How-tos
  3. Storage backend how-tos

Add a file by https

You can register named-file content using HTTP or HTTPS in the same way that you would using S3, SFTP, or the local filesystem.

While adding files by HTTP is a snap, the remote file name in the URL may not be helpful. CSV and Excel files on the web sometimes come out of applications. When they do they may lack a regular file name. Luckily there's an easy way to update the registered content with a name.

Happy path

First, let's create a simple harness. Our goal is to register a file from the web in CsvPath Framework. We are importing it, or staging it, as a named-file. For context we'll run a csvpath against our new content and access the results.

from csvpath import CsvPaths

class Main:
    def load_http_content(self):
        paths = CsvPaths()
        paths.file_manager.add_named_file(
            name="orders",
            path="https://drive.google.com/uc?id=1zO8ekHWx9U7mrbx_0Hoxxu6od7uxJqWw&export=download",
        )
        paths.paths_manager.add_named_paths(name="http_demo", paths=["$[*][yes()]"])
        paths.collect_paths(pathsname="http_demo", filename="orders")
        results = paths.results_manager.get_named_results("http_demo")

The main event is the method call starting on line 6 that adds a named-file called orders. The new version of orders is coming from a Google Drive account with a long opaque HTTPS URL. So far so good.

Let's fix that name

However, when we look at the registered file's manifest there is a gotcha. Our manifest is at ./inputs/named_files/orders/manifest.json. (If you aren't using the default location for named-files your path will be different). When we open it we see:

Lots of things went right. Our time, uuid, from URL, and fingerprint are fine. But the file type should be csv and the file name and file home are garbled because the HTTP URL didn't point to a physical file so much as identify an item of content held by the Google Drive application.

Since we know the data we're downloading is CSV data and we know what it is about, we can easily update the named-file to add clarity. We'll use the patch_named_file method. The patch_named_file method is on FileRegistrar. The FileManager that you use to add a named-file has a registrar to keep track of file metadata. It can help us easily compensate for HTTP's deficiencies.

from csvpath import CsvPaths

class Main:
    def load_http_content(self):
        paths = CsvPaths()
        paths.file_manager.add_named_file(
            name="orders",
            path="https://drive.google.com/uc?id=1zO8ekHWx9U7mrbx_0Hoxxu6od7uxJqWw&export=download",
        )
        paths.file_manager.registrar.patch_named_file(
            name="orders", patch={"type": "csv", "file_name": "download.csv"}
        )
        paths.paths_manager.add_named_paths(name="http_demo", paths=["$[*][yes()]"])
        paths.collect_paths(pathsname="http_demo", filename="orders")
        results = paths.results_manager.get_named_results("http_demo")

The fix is line 10. We're passing a "patch" that will change the type of the file to cvs and the name of the file to download.csv. The FileRegistar updates the manifest.json so everything tics and ties. This is what you should see:

And you're good. The orders named-file is ready to work.

Of course using HTTP to load content into a named-file doesn't always require the extra step to patch it. If you have a URL like https://csvpath.org/my-data-file.csv you won't need to help CsvPath know what the file name and file type are because it's obviously CSV data in a file called my-data-file.csv. But if you do need to make an adjustment, that's how you do it.

PreviousLoading files from S3, SFTP, or AzureNextStore source data and/or named-paths and/or the archive in Azure

Last updated 3 months ago

The named-file manifest for a new item of CSV content.