Well-formed, Valid, Canonical, and Correct
The world of data is messy. It's full of terms-of-art with squishy definitions. People use terms in ways that match their perspective or product, but may not be commonly accepted.
This page tackles the definitions of a few data management terms that are often used loosely. I'll say up front two things: 1) this page is a stub to be filled in over time, and 2) you may not agree with my definitions.
Well-formed
Data that is well-formed first and foremost matches a physical specification, and, secondly, has the correct "outline" to be an item of data of the form expected.
Valid
Files that are valid have data that is compared against a definition of what good data looks like. Data can be validated using rules or models. An XSD
is primarily a model. A Schematron
file is principly rules. In fact, a model is a short-hand way of writing rules. And, in this context, a set of rules is just a classification. But in practice it's simple: an item of data that doesn't match its schema is considered invalid.
Canonical
A canonical form is the form that is preferred over other possible forms of the same data. A simple example is the term IBM
. Its canonical form may be IBM
. It may also be seen as I.B.M.
or International Business Machines
. If we are canonicalizing data using this mapping to IBM
and we see I.B.M.
we substitute the canonical form. Note that if there are multiple accepted forms the canonical form is any of them.
Correct
Correct data is more than well-formed + valid + canonicalized. Correct means that the semantic and business rule content of the data meets expectations. For example, imagine a CSV
file that includes a list of companies. Each company has an area of commercial activity. We see that:
The file is readable as a
CSV
file, so it is well-formedThe file has values under all headers in all rows, so for our purposes we'll call it valid
The company name
I.B.M
has been canonicalized toIBM
so we'll say that the data is in a canonical formAnd the company listed as
IBM
is described as being in the business ofSunflower Farming
Due to the last bullet having sketchy intelligence — we don't think IBM
grows sunflowers — we'll say that this data is incorrect.
Where CsvPath can help
CsvPath can help with validity, canonicalization, and checking correctness. It cannot help you with well-formedness checking. CsvPath treats any file that Python can read as CSV or Excel as being well-formed.
By contrast, most of the so-called CSV validators on the Internet are simple well-formedness checkers. A smaller number of them can check a structural definition of a file's headers. Vanishingly few go beyond that.
Last updated