CsvPath Framework
  • CsvPath
  • DATA PREBOARDING
  • Getting Started
    • Quickstart
    • Organizing Inbound Data
      • Dataflow Diagram
      • The Three Data Spaces
        • Source Staging
        • Validation Assets
        • Trusted Publishing
      • How Data Progresses Through CsvPath Framework
        • Staging
          • Data Identity
          • Handling Variability
            • Templates
            • Named-file Reference Queries
          • Registration API and CLI
            • Loading
            • Going CLI-only
        • Validation and Upgrading
          • Templates
          • Run Using the API
          • Running In the CLI
          • Named-paths Reference Queries
        • Publishing
          • Inspect Run Results
            • Result API
            • More Templates and References
          • Export Data and Metadata
    • Csv and Excel Validation
      • Your First Validation, The Lazy Way
      • Your First Validation, The Easy Way
      • Your First Validation, The Hard Way
    • DataOps Integrations
      • Getting Started with CsvPath + OpenTelemetry
      • Getting Started With CsvPath + OpenLineage
      • Getting Started with CsvPath + SFTPPlus
        • SFTPPlus Implementation Checklist
      • Getting Started with CsvPath + CKAN
    • How-tos
      • How-to videos
      • Storage backend how-tos
        • Store source data and/or named-paths and/or the archive in AWS S3
        • Loading files from S3, SFTP, or Azure
        • Add a file by https
        • Store source data and/or named-paths and/or the archive in Azure
        • Store source data and/or named-paths and/or the archive in Google Cloud Storage
      • CsvPath in AWS Lambda
      • Call a webhook at the end of a run
      • Setup notifications to Slack
      • Send run events to Sqlite
      • Execute a script at the end of a run
      • Send events to MySQL or Postgres
      • Sending results by SFTP
      • Another (longer) Example
        • Another Example, Part 1
        • Another Example, Part 2
      • Working with error messages
      • Sending results to CKAN
      • Transfer a file out of CsvPath
      • File references and rewind/replay how-tos
        • Replay Using References
        • Doing rewind / replay, part 1
        • Doing rewind / replay, part 2
        • Referring to named-file versions
      • Config Setup
      • Debugging Your CsvPaths
      • Creating a derived file
      • Run CsvPath on Jenkins
    • A Helping Hand
  • Topics
    • The CLI
    • High-level Topics
      • Why CsvPath?
      • CsvPath Use Cases
      • Paths To Production
      • Solution Storming
    • Validation
      • Schemas Or Rules?
      • Well-formed, Valid, Canonical, and Correct
      • Validation Strategies
    • Python
      • Python vs. CsvPath
      • Python Starters
    • Product Comparisons
      • The Data Preboarding Comparison Worksheet
    • Data, Validation Files, and Storage
      • Named Files and Paths
      • Where Do I Find Results?
      • Storage Backends
      • File Management
    • Language Basics
    • A CsvPath Cheatsheet
    • The Collect, Store, Validate Pattern
    • The Modes
    • The Reference Data Types
    • Manifests and Metadata
    • Serial Or Breadth-first Runs?
    • Namespacing With the Archive
    • Glossary
  • Privacy Policy
Powered by GitBook
On this page
  • Well-formed
  • Valid
  • Canonical
  • Correct
  • Where CsvPath can help
  1. Topics
  2. Validation

Well-formed, Valid, Canonical, and Correct

The world of data is messy. It's full of terms-of-art with squishy definitions. People use terms in ways that match their perspective or product, but may not be commonly accepted.

This page tackles the definitions of a few data management terms that are often used loosely. I'll say up front two things: 1) this page is a stub to be filled in over time, and 2) you may not agree with my definitions.

Well-formed

Data that is well-formed first and foremost matches a physical specification, and, secondly, has the correct "outline" to be an item of data of the form expected.

Valid

Files that are valid have data that is compared against a definition of what good data looks like. Data can be validated using rules or models. An XSD is primarily a model. A Schematron file is principly rules. In fact, a model is a short-hand way of writing rules. And, in this context, a set of rules is just a classification. But in practice it's simple: an item of data that doesn't match its schema is considered invalid.

Canonical

A canonical form is the form that is preferred over other possible forms of the same data. A simple example is the term IBM. Its canonical form may be IBM. It may also be seen as I.B.M. or International Business Machines. If we are canonicalizing data using this mapping to IBM and we see I.B.M. we substitute the canonical form. Note that if there are multiple accepted forms the canonical form is any of them.

Correct

Correct data is more than well-formed + valid + canonicalized. Correct means that the semantic and business rule content of the data meets expectations. For example, imagine a CSV file that includes a list of companies. Each company has an area of commercial activity. We see that:

  • The file is readable as a CSV file, so it is well-formed

  • The file has values under all headers in all rows, so for our purposes we'll call it valid

  • The company name I.B.M has been canonicalized to IBM so we'll say that the data is in a canonical form

  • And the company listed as IBM is described as being in the business of Sunflower Farming

Due to the last bullet having sketchy intelligence — we don't think IBM grows sunflowers — we'll say that this data is incorrect.

Where CsvPath can help

CsvPath can help with validity, canonicalization, and checking correctness. It cannot help you with well-formedness checking. CsvPath treats any file that Python can read as CSV or Excel as being well-formed.

By contrast, most of the so-called CSV validators on the Internet are simple well-formedness checkers. A smaller number of them can check a structural definition of a file's headers. Vanishingly few go beyond that.

PreviousSchemas Or Rules?NextValidation Strategies

Last updated 5 months ago