CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  1. Getting Started
  2. How-tos

Creating a derived file

How would you create a new CSV file from an existing CSV file? Easy!

  • First set up a project

  • Create a CsvPaths

  • Load the file manager and paths manager with the original file and your csvpath

  • Create a simple csvpath, like the one below (or not so simple, if needs be)

  • Run your csvpath

  • Check the archive folder to see your results

Here is the Python side of things:

from csvpath import CsvPaths

if __name__=="__main__":
    paths = CsvPaths()
    paths.paths_manager.add_named_paths_from_file(name="derived",
                                                  file_path="assets/derived.csvpath")
    paths.file_manager.add_named_file(name="data", 
                                      path="assets/Medicare_Claims_data-550.csv")

    paths.collect_paths(pathsname="derived", filename="data")

As usual, we create a CsvPaths instance and feed it a data file and a csvpath file. Then we call collect_paths, passing the named-file and named-paths we used.

Here's a simple csvpath we could run on some Medicare data. It limits the data collected, adds a column, replaces some text, and creates the new file. It's obviously not the absolute simplest example, but the extra feature-use gives a sense for the possibilities.

~
  name: create derived file
  description: we're going to create a new csv file that has
               only the lines and headers we want. :
~
$[1*][
    #Question 
    append("Day", now() )
    regex(/Acute Myocardial Infarction/, #Topic) -> replace("Topic", "AMI")
    collect("Topic", "Question", "Day")
]

Let's break this down. Line's 1-5 are just comments. There are two metadata fields. The name field sets the identity of this csvpath. The identity would help us trace validation and syntax errors if we had multiple csvpaths. Since we don't have multiple csvpaths it is really just documentation.

Line 6 says we are going to skip the header row. We will of course use the headers, we just don't want to treat them as data.

Line 7 is an existence test for values in the header we name. Only lines that have values in Question will be collected. Remember that we're ANDing all the match components together to figure out which lines to collect.

On line 8 we append a new header that always has the datetime value given by now(). This line has no effect on matching.

Line 11 does a replacement in the #Topic header. It changes Acute Myocardial Infarction into AMI. We're using a when/do expression. The left-hand side of a when/do impacts matching, unless you explicitly say it shouldn't using a nocontrib qualifier. Since we didn't use nocontrib, line 11 limits our results to lines where the topic header includes the words Acute Myocardial Infarction.

Finally line 12 limits collection to only the named headers. Those will go into our output data file.

This is what your project and results would look like:

The derived file you created is data.csv. And your new data would look like this:

It is also possible to use print() to create new files. That approach is flexible and may be valuable for certain cases. However, print() is a relatively slow function and for most purposes it doesn't add much additional value. Collect(), along with replace() and append(), are usually the better way to go.

PreviousDebugging Your CsvPathsNextRun CsvPath on Jenkins

Last updated 7 months ago