Tutorial: How to build an Intake catalog

Intake is a new library from Anaconda to take the pain out of loading common datasets into your Python analysis. It allows you to package a dataset so that it can be installed via Conda and imported into your Python session using the intake libary.

This tutorial will walk through writing an Intake catalog, building a conda package and using it within your Python session.

Requirements

For this tutorial you will need a functioning Python environment and the Conda package manager. If you don’t have this I recommend you follow the docs to get set up.

We will also need to install Intake, the Conda build tools and the Anaconda Cloud tools.

$ conda install -c intake intake conda-build anaconda-client

You will need to create an Anaconda Cloud account and login using the command line. We will also configure conda to automatically upload all packages we build to Anaconda Cloud.

$ anaconda login
Using Anaconda API: https://api.anaconda.org
Username: <username>
<username>'s Password:
login successful
$ conda config --set anaconda_upload yes

Getting some data

For this tutorial we need some data to work with. To keep this tutorial simple we are going to steer clear of large, complex, multidimensional weather data and work with a smaller and more tabular dataset. Luckily here in Exeter we have a data portal called the Exeter Data Mill which contains datasets about the city. One of those datasets is historic ticket sales in our car parks which has been shared by Exeter City Council.

We are going to work with the full raw dataset which is sales per hour for 28 car parks over four years. The data is stored as a CSV file containing just under 1 Million rows and is around 88MB.

The data is hosted on AWS S3 and can be found at the following URL:

https://s3-eu-west-1.amazonaws.com/files.datapress.com/exeter/dataset/car-park-tickets-sold/2018-07-26T10%3A23%3A33.34/TickSalesbySiteByDateByHour_20140305-20180717.csv

Writing an Intake manifest

The Intake library automatically populates its catalog from YAML manifest files which are placed in $PREFIX/share/intake/ where PREFIX is the Conda environment path. This means other Conda packages can drop YAML files into that directory and Intake will pick them up.

For our car park dataset we are going to create a minimal manifest file which will include a description of the data, the Intake driver to use, the URL to download the CSV from and some metadata about the canonical location.

# car-park-tickets-sold.yaml
sources:
  car_park_tickets_sold:
    description: Data about the number of tickets sold in Exeter car parks from Exeter City Council (https://exeterdatamill.com/dataset/car-park-tickets-sold)
    driver: csv
    args:
      urlpath: 'https://s3-eu-west-1.amazonaws.com/files.datapress.com/exeter/dataset/car-park-tickets-sold/2018-07-26T10%3A23%3A33.34/TickSalesbySiteByDateByHour_20140305-20180717.csv'
    metadata:
      origin_url: 'https://exeterdatamill.com/download/car-park-tickets-sold/cf542d64-0dea-4370-9006-a9e5f965ce1a/TickSalesbySiteByDateByHour_20140305-20180717.csv'

Note here that we are using the csv intake driver which means Intake will give us a Pandas Dataframe by default when we try to load this data, but can also provide us with a Dask Dataframe if we wish. We could also install Intake plugins to extend this and allow us to get the data in different ways.

Writing a Conda package

Now that we have our manifest file we need to package it so that anyone can install our Intake catalog.

A Conda package requires two files; a meta.yaml file which describes the package and a build.sh file which will be executed at install time.

To keep all of this together I’ve created an example repository on GitHub which you can have a look at. Within the repository we will create a directory called car-park-tickets-sold to put our Conda build files in.

meta.yaml

The meta.yaml file only needs to contain the package.name and package.version properties to be a valid package, but we will also add a few other things. We will specify that this package will run on a generic architecture as it is only a config file and the CPU architecture doesn’t matter. We will specify that the package needs intake to be installed at run time, it wouldn’t be much use without it! Finally we will include some info about data licensing which I’ve copied from the original data on the Exeter Data Mill.

# meta.yaml
package:
  version: '1.0.3'
  name: 'data-exeter-car-park-tickets-sold'

build:
  number: 0
  noarch: generic

requirements:
  run:
    - intake
  build: []

about:
  description: Data about the number of tickets sold in Exeter car parks from Exeter City Council (https://exeterdatamill.com/dataset/car-park-tickets-sold)
  license: OGL v3
  license_family: OTHER
  summary: Data about Exeter car park ticket sales

build.sh

The build.sh script will simply create a subdirectory in the shared Intake directory and copy our Intake manifest into it.

#!/bin/bash

mkdir -p $PREFIX/share/intake/exeter-car-park-tickets-sold
cp $RECIPE_DIR/car-park-tickets-sold.yaml $PREFIX/share/intake/

car-park-tickets-sold.yaml

We also need to copy the Intake manifest we wrote before into the package.

Building and publishing the package

Now that we have these three files in our car-park-tickets-sold directory we need to build it with the Conda build tools. Note that we need to specify -c intake to ensure that Conda can find the intake package.

conda build -c intake car-park-tickets-sold

This will build the package, output a file called data-exeter-car-park-tickets-sold-1.0.0-0.tar.bz2 and upload it to the Anaconda repository thanks to the credentials we set up earlier. This will then be listed under our username as a new package.

Installing our package

Now that our package exists and is publicly accessible we can install it using conda install -c . For example anyone can install my demo version of the car parking dataset with the following command:

conda install -c jacobtomlinson data-exeter-car-park-tickets-sold

Using our package

Now that we have installed our dataset we should be able to import intake in a python session and see our data in the catalog list.

Python 3.6.3 |Anaconda, Inc.| (default, Nov  9 2017, 00:19:18)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import intake
>>> list(intake.cat)
['car_park_tickets_sold']

We can then read the dataset into a pandas dataframe and head the first five rows.

>>> df = intake.cat.car_park_tickets_sold.read()
>>> df.head()
   Year Month        Date   Hour                                        Site  Tickets                    SiteSub
0  2014   Mar  2014-03-05  08:00  Purchase Count - Bampfylde Street Car Park     10.0  Bampfylde Street Car Park
1  2014   Mar  2014-03-05  16:00  Purchase Count - Bampfylde Street Car Park      2.0  Bampfylde Street Car Park
2  2014   Mar  2014-03-06  08:00  Purchase Count - Bampfylde Street Car Park      NaN  Bampfylde Street Car Park
3  2014   Mar  2014-03-06  09:00  Purchase Count - Bampfylde Street Car Park      NaN  Bampfylde Street Car Park
4  2014   Mar  2014-03-06  10:00  Purchase Count - Bampfylde Street Car Park      NaN  Bampfylde Street Car Park

Conclusion

That’s it! We have packaged a very simple CSV dataset from an open data catalog into an Intake Conda package, installed it into our environment and loaded the data.

This is just scratching the surface of what Intake can do and I urge you to explore the documentation to learn more about what you can do.

PangeoJacob Tomlinson