> For the complete documentation index, see [llms.txt](https://www.csvpath.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://www.csvpath.org/topics/how-tos/ckan.md). # CKAN

## What is CKAN? [CKAN](https://ckan.org/) is the leading data portal. It powers massive data repositories, including the US, EU, and UK government's open data publishing, along with those of numerous other countries, provinces, cities, research centers, and NGOs. CKAN is used by private companies to host data assets internally for use across departments and divisions. Corporate data portal implementations include ones at LEGO, National Grid, Suncorp Bank, and many others. CKAN is a **data portal**. A data portal is a purpose-built catalog for data products in the form of downloadable data and metadata references to online or on-request sources. The goal of a data portal is to offer high-value, validated, versioned datasets with sufficient metadata to fully characterize their content and provenance. Data portals often provide known-good snapshots of datasets that support research, open government, AI development, BI development, and data mastering tools such as ontologies, reference datasets, and controlled vocabularies. ## Steps to using CKAN + CsvPath There are only a few steps to start using the CKAN integration: * Create or get access to a CKAN instance * Make two small changes to your CsvPath config.ini * Add CKAN directives to a csvpath * Load your named-paths group and run it The first one is the only heavy-lifting. If you have a CKAN instance you can skip it. The remaining three steps should take you about 15 minutes, using CsvPath's CLI and following the instructions below. First, a screenshot of CKAN + CsvPath and a word about why are we doing this. ## The gap between data lake and data portal **High quality data is useless unless it is known and accessible to solve high-value problems. Likewise, a high-value data portal connecting consumers to producers is useless if the data produced is untrustworthy.** Most organizations have piles of data. Many organizations have some form of a data lake. Regardless of tooling and investment, most data lakes quickly become a collection of messy, lossy, inexplicable silos. How can the right data get to the data portal in a validated, known-good form for clear presentation to consumers? CsvPath can help. Data products are an interface to a data operation. One of CsvPath's core use cases is in the automated validation, canonicalization, metadata management, and publishing of data products to data catalogs. It is the filter that guarantees that a known source presents known-good data in an expected form through a durable and explainable process. CsvPath does this by applying the [**Collect, Store, Validate Pattern**](broken://pages/OejPZGOrRCbYG54Vym2l) to the challenge of data departure. Collect, Store, Validate centralizes operations, makes processes highly consistent, keeps records in the form of immutable intermediate products and metadata, and verifies that data matches a schema and/or set of business validation rules. The details of how CsvPath does this are on every page of this site, so here we'll just focus on linking up CsvPath and CKAN. Our goal here is to get CsvPath to post valid data to CKAN. It will look something like this screenshot of files that came from running a named-paths group.

## Installing CKAN This section isn't a step-by-step how-to for installing CKAN. The [instructions for setting up CKAN are here](https://docs.ckan.org/en/2.9/maintaining/installing/install-from-package.html). This page may help too, but rely on CKAN's docs first and foremost. A CKAN implementation comprises: a Python web application, Python applications and APIs for data management, a Postgres database, and a Solr search engine. We used the package method CKAN's docs suggest. While we aren't Docker wizards, we can share a quickly-made dockerfile that helped us. No doubt you can improve on it! {% file src="/files/gF1SQMkU6Cv5GzvP9EfH" %} This is not a sophisticated solution, but feel free to use it as a getting started aid. {% endfile %} We installed CKAN on a Mac with Apple Silicon. That required this run command that accounts for the different architecture: ```bash docker run --platform linux/amd64 -p 80:80 -it ckan_local:latest ``` The dockerfile uses this script to do some setup work that we didn't bother to automate for our dev and test instances. Obviously this isn't how you'd do it in a regular dev or production setting. {% file src="/files/ZM46rgG546XBcZHKsL2Q" %} Remember to edit the ckan.ini file to have your server IPs and passwords. This file lives at: `/etc/ckan/default/ckan.ini`. With that dockerfile we had a basic CKAN server up and running in just a few minutes. Raising the [Solr](https://solr.apache.org/) instance using CKAN's Solr docker image was even more of a snap. You should be able to login at . If all is well you will see the CKAN frontpage.

You should see this page on http://localhost:80

CKAN isn't ugly, but it does look plain just out of the box. To see for yourself how beautiful CKAN can be take a look at [the CKAN showcase sites](https://ckan.org/showcase). ## Setting up CsvPath Next let's set up CsvPath to talk to CKAN. This part should be a snap, partly because it's simple and partly because you've probably done it already from other examples on this site. We'll use the example from [Another Example](/topics/how-tos/a-longer-example/another-example-part-2.md). First create a Poetry project. You can use Pip or any tool you like, but we like Poetry. In the terminal do: ```bash poetry new ckan ``` `cd` into the new `ckan` project directory. Then add CsvPath to your new project like this: ```bash >> poetry add csvpath Using version ^0.0.501 for csvpath Updating dependencies Resolving dependencies... (1.3s) Package operations: 31 installs, 0 updates, 0 removals - Installing certifi (2024.12.14) - Installing charset-normalizer (3.4.1) - Installing idna (3.10) - Installing packaging (24.2) - Installing six (1.17.0) - Installing typing-extensions (4.12.2) ... ``` You'll see a few more dependencies installed than I'm pasting in here, but otherwise, that's it. Now we'll copy the files from the [Another Example](/topics/how-tos/a-longer-example/another-example-part-2.md) pages. The ones attached here are slightly updated so use them, even if you did the example and have your own. Put the csv file in: `assets/csvs` and the csvpaths in `assets/csvpaths`. You *can* put these files anywhere within the project, really, because we are going to use CsvPath's CLI to import them, but for now, still with those directories. Put the JSON file in the project root directory; again, it *could* go anywhere you like within the project. {% file src="/files/gm6PPSe5OpMampGu6sZd" %} {% file src="/files/GQohaVzwJfUmCV3kBS1k" %} {% file src="/files/64YoSU7a6njg77I6Z5nz" %} {% file src="/files/aLb9vLz1RTGyRoGowDp2" %} {% file src="/files/Fm011sPckDhG9MqZ8E6p" %} {% file src="/files/wrYwbt9WAVPZZ1kpDHR7" %} {% file src="/files/oa5dKaETatIm8qEzcMar" %} {% file src="/files/t5f1UzEabIH8v2ZWXiZW" %} {% file src="/files/lpjg1WxPBaTjIu7A5qM7" %} That makes: * 7 csvpath files * 1 csv file * 1 json file ## Config file update The integration works by adding a CKAN listener to `results` events. A `results` event is generated when a run starts or completes. To tell CsvPath to include the CKAN listener we need to make a small change to `config/config.ini`. Since it's much simpler to have CsvPath create a default config file, let's fire up the CLI to give CsvPath a chance to generate it. In the terminal do: ``` poetry run cli ``` If you aren't using poetry have a look in the pyproject.toml to see the command we're running so you can run it yourself. When the CLI comes up you should see:

Check to make sure the config directory was created. If it was, select `quit`. Next open `config/config.ini`. Check the CKAN listener configuration under the \[listeners] section. If it is commented out, remove the `#` comment marker. Then add ckan to the listener `groups`. Your file's `[listeners]` section should look like:

Create an API token in CKAN in your profile page. It's [a quick task described here](https://docs.ckan.org/en/2.9/api/index.html?#authentication-and-api-tokens). Add your API token to the `api_token` key in the `[ckan]` section. If your CKAN server is at a different address, change the server key to point to it.

While you are in CKAN, create an organization called `Archive`. Click on the `Organizations` tab and then click the `Add Organization` button.

CsvPath's `Archive` will map to CKAN's `Archive` organization. You can change the name of the archive to anything you like (the setting is in `config.ini`), but for now, stick with `Archive`. Now CsvPath will send named-group run `results` events to the CKAN integration so that it can post metadata and files to CKAN using CKAN's API. ## Adding CKAN metadata directives The last part of connecting CKAN and CsvPath is to add instructions for how the events should be handled. The instructions will be in the form of metadata directives, similar to CsvPath's [modes settings](/topics/practical-stuff/the-modes.md). Metadata directives are instructions you put in the external comments of a csvpath. They are special metadata fields that the CKAN integration looks for. Metadata fields are created by keywords followed by colons, like: ``` description: this is a user defined metadata field named description. lunch-menu: this is another user defined metadata field, it is named lunch-menu. ``` In this example `lunch-menu` starts a new metadata field because it has a colon. That means that the `description`-keyed metadata is: `this is a user defined metadata field named description.` Metadata goes in external comments. An external comment is one that is outside the csvpath; above it or below. Here are the possible directives with possible values and/or examples. You can [learn more about CKAN directives here](/topics/how-tos/ckan/ckan-csvpath-setup.md). * `ckan-publish`: `always` | `on-valid` | `on-all-valid` | `never` * `ckan-group`: `use-archive` | `use-named-results` | any alphanum string * `ckan-dataset-name`: `use-instance` | `use-named-results` | `var-value:name` | a literal * `ckan-dataset-title`: a metadata field name | `var-value:name` * `ckan-visibility`: `public` | `private` * `ckan-tags`: any alphanum | `instance-identity` | `instance-home` | `var-value:name` * `ckan-show-fields`: e.g. `line_number`, `identity`, `validation-mode`.... * `ckan-send`: `all` | `printouts`, `data`, `metadata`, `unmatched`, `vars`, `errors`, `manifest` * `ckan-printouts-title`: e.g. `Background` * `ckan-data-title`: e.g. `Orders` * `ckan-unmatched-title`: e.g. `Orders` * `ckan-vars-title`: e.g. `Orders` * `ckan-meta-title`: e.g. `Orders` * `ckan-errors-title`: e.g. `Orders` * `ckan-split-printouts`: `split` | `no-split` Yes, that's a lot! You won't use them all, and very likely you will come to appreciate the flexibility. When you are first getting started you may want to have the [docs page](/topics/how-tos/ckan/ckan-csvpath-setup.md) at hand. Here's how we updated the `sku_upc.csvpath` file with CKAN directives. You don't have to use all of these, but it doesn't hurt to try them.

That's a lot of CKAN instructions to apply to your csvpaths. Most likely you'll usually use fewer than this.

{% file src="/files/qzIq1OCx2nhFmkR41pLt" %} ## Let's see it working! To see the CKAN integration in action we have to run a csvpath using a `CsvPaths` instance, of course. The fastest way for us to do that is using CsvPath's minimalist CLI. Creating a small Python driver script is also super simple, but the CLI allows us to even skip that little bit of Python. Fire up the CLI again using `poetry run cli`. You will again see:

We're going to stage our data file and load our csvpaths. That essentially means we're going to import those assets into the CsvPath Library's workspace so we can run our named-paths group. A named-path group is simply a collection of csvpaths that are run as a single group and known by a name. Hit return with `named-paths` selected.

Then select `add named-paths`. Give your paths a name. Call them `Orders`.

Next we're going to tell the CsvPath Library where the csvpaths that will go into `Orders` are. We'll do that with our JSON file. Select JSON.

And pick your `orders.json` file that you downloaded a moment ago.

We're done setting up our named-paths group. Now let's stage our data as a named-file. The process is the same. Click on named-files.

Then click on add-named file.

Call your file `March-2024`.

We have our file handy and it's just the one we stuck in the assets directory, so let's just pick it specifically. Our other options would be a list of named files in a JSON or adding all the files in a directory named by their filenames. Select `file` and hit `return`. Follow along with the next three screenshots.

After selecting the CSV file it should import and take you back to the top menu. Now we're good. Time to run our named-paths group and see the results in the `Archive` directory and also automatically promoted into CKAN. Running a named-paths group is easy. Select `run`.

The CLI asks you for a data file first. Pick the one you are offered. It's what we just staged.

Next we pick the named-paths group. In this case we have two choices because there were two groups in the JSON we used to define and load the groups. `top_matter_import` is used by `orders`. `orders` is the group we want to run. Select `orders` and hit `return.` Lastly, the CLI wants to know what run strategy you want to take, `collect` or `fast-forward`. The `collect` approach captures all the lines that match your csvpaths' rules. `fast-forward` simply runs the csvpaths without capturing matches. Doing a `fast-forward` run gets you variables, errors, validations, etc. so it is quite useful, and also lightweight. But for our purposes here let's use `collect` to capture matches.

And away we go! Your run will produce validation messages and informational printouts. We intentionally fed the csvpaths data with problems. And we created some output that is going to a separate `Printer` instance so we can see how multiple printouts can be created. What you see should look like:

Your run produced lots of assets in your new archive directory. Let's have a look. Open the project's root directory and drill into the `archive` folder. The Archive is where the CsvPath Library stores results. You can name it anything you like — `archive` is just the default. This is what you should see:

Your archive after one run of the orders named-paths group, as seen in the MacOS Finder

In this image you're looking at the `order` group result flies for the `2024-12-25_04-46-55` run (in my case; your run identifier will be different, of course) in the results of the csvpath identified as `upc-sku`. These files are: * The `data.csv` of matched lines * The `unmatched.csv` of lines that did not match your csvpath's rules * The `manifest.json` that gives metadata about the `upc_sku.csvpath` part of the named-paths run * `meta.json`, a file of any user-defined metadata and the runtime metadata and stats * Our `printouts.txt` containing all the printed statements from the run * The `vars.json` file that contains all the variables that were created during the run. That's all standard CsvPath stuff. We haven't looked at anything specific to CKAN yet. Keep in mind that the files you see in the screenshot are for just one of the six csvpaths in the `orders` named-paths group. All six csvpaths were run against the input data file. Each has its own outputs. In this example we are only sending results to CKAN for the `upc-sku` csvpath. ## The CKAN results What should we see in CKAN? Looking at our `upc_sku.csvpath`'s metadata you can see what we're asking for: * `ckan-group: A Big Test`. This says we want to have our results associated with a CKAN group. If the group doesn't exist, it will be created. * `ckan-dataset-name: orders_march`. We're explicitly giving our CKAN dataset a name. The name will become a slug in the website and an identifier that can be used like an ID in some cases. The dataset will also have an autogenerated ID. * `ckan-dataset-title: Orders March 2024`. Setting a title gives the dataset a prettier name than if we just used the actual name. * `ckan-visibility: public`. As you would guess, we're making this dataset immediately visible to anyone with access. * `ckan-send: data, printouts, unmatched`. This is the big one. Here we say what data we want to send to CKAN. In this case, we send three of the standard files CsvPath generates to CKAN. * `ckan-split-printouts: split`. Printouts come from calling `print()` on `Printer` instances. Each printer is separate and handles print statements in its own way. The `upc_sku.csvpath` uses both the default printer and also a different named printer, called `Headers by line`, for some printouts. In the `printouts.txt` the default and `Headers by line` printouts are separated by a delimiter so they can be easily extracted. With `ckan-split-printouts` we can split the printouts into one file per `Printer` instance. This makes it easy to create user-friendly focused reports that are delivered in CKAN in a way that is clear for report readers who don't know CsvPath. In this example, our default printouts are the validation errors. The `Headers by line` printouts report the headers in effect at each line in the file. In CKAN these reports will be separated into two files and given titles that make clear what each file contains. Because we have two sets of printouts and are splitting them we will send CKAN four files total, not three. * The remaining CKAN directives assign more helpful names to the files we're sending to CKAN. Again, we don't want to assume that all CKAN users know what CsvPath's standard files contain. Our assumption is that CKAN users should not have to know the details of CsvPath Language or the CsvPath Library's workflow. What we get is a dataset in the `Archive` organization associated with the `A Big Test` group titled `Orders March 2024`. When you open the new dataset it looks similar to this screenshot.

Each time we rerun our named-paths group we will get new data and metadata files in a new run directory. And each run's events will be forwarded to CKAN. The result will be that this page is updated, new versions are captured, and the activity stream is updated *(Side note: if for any reason you want to delete your group or dataset and start again, remember that you have to log in as admin and empty the trash at* [*http://localhost/ckan-admin/trash*](http://localhost/ckan-admin/trash) *to hard delete your assets. CKAN uses soft deletes. Simply deleting as a regular user doesn't clear assets out of the CKAN system.)*

Meanwhile, back in CsvPath, the archive and the inputs directories will capture each change of all your artifacts for every run, in perpetuity with clear identities and hash codes to help you pin down exactly what happened if you should ever be asked about the lineage or chain of custody. And should there be a validation failure, that problem will never get to CKAN — instead you'll be able to handle it at the source and only promote trustworthy data to CKAN and its data customers. There's a lot going on in this integration. At a high-level it's quite simple. Of course, the configuration details and use cases will settle in gradually. Spend a bit of time exploring. You'll be impressed with what CKAN offers and how well its mission fits with CsvPath's.