Creating a 0.5Tb hyper cube - without reading one byte

This blog post is accompanied by this notebook which creates a data cube representing half a terabyte of data without reading a single byte from the source data.

A recurring theme in the Informatics Lab is how to create useful representations of large data sets that span many thousands of files. A representation we think is useful is a ‘lazy hypercube’. A lazy hypercube is an object that looks and acts like a n-dimensional array but which doesn’t need to read any of the underlying data until and explicit request to view or calculate something is submitted. What’s more, when such a request is made only the minimum amount of data to satisfy that request is read.

A useful lazy hyper-cube has all the metadata needed to perform any operation on the hypercube (such as subsetting, combine, etc) without needing to read any of the underlying data. However to get this metadata in the first place usually requires reading the 100s of 1000s of individual files that constitute the dataset. This is a slow process and so many approaches solve this by doing it ‘offline’ and saving the resulting aggregated metadata into a file that can be later quickly and easily read and used. While this approach has many merits I was keen to explore a different option, extracting the metadata from the other place it often lives, someones head.   


The idea behind this is that the person working with the dataset knows what the shape of the hypercube that would best represent the data or suit the task they have in mind. If we provide tools to define and build a hypercube from knowledge rather than from data files we might be able to speed up the process of hypercube generation and enable individuals to generate novel cubes bespoke to their problem rather than mandating some curated set of hypercubes.

The Process

The process is fairly simple:

  1. Understand your dataset. How many files? What naming convention? What fields in a file? What’s the shape of the data in the file? What quirks are there (first file contains and extra time step).

  2. Decide on the shape you’d like your hypercube to take. Think about what dimensions and what order of those dimensions. Be deferential to your source data and the task you want to complete if your datafiles are made up of fields with shape `[time][lat][lon]` it will likely be more performant  if you keep those as the bottom dimensions and stack additional dimensions on top i.e `[realisation][model_run][time][lat][lon]`

  3. Travers your hypercube space working from the bottom dimensions/chunks (likely one file) up. Your bottom chunk will be a proxy object representing access to a datafile but not reading any data until required. (More on this later)

  4. Aggregate up through the higher dimensions using concatenation or stacking operations until you have lazy n-dimensional array representing your whole dataset.

  5. Associate the relevant metadata to this object using a higher order library like Iris.

Hey presto, you have a hypercube that takes seconds to generate but represents many thousands of files!

Risks and mitigations

The main risk with this method is that you make an incorrect assumption about what’s in the file or how the files are laid out and build a hyper-cube that looks correct but fails when you try to access the data. Worse is that you assumptions lead to a cube that works correctly but is actually returning different data than you were expecting.

The work around to this is a just in time checking mechanism that checks the metadata of the file is as expected when data is accessed and not before. The implementation also has the advantage that you can dynamically mock out missing data or even files by returning null data if the file doesn’t exist. This is done by creating a proxy object that looks like an array but on the first request for some data (which comes into the `__getitem__`) it performs checks like does the file exist, is the data the correct shape, etc. It stores the result of these checks for future data requests. If any of the tests fail it returns null/masked data rather than data from the file. In this prototype the checks we’ve created are very minimal and further checking could be done but there would be trade off with performance.

One of the nice features of this is it gives you a method have handling missing data or files even data that is there in theory space but not in practical space.

This general approach works for regular, clean datasets but has problems when things get messy. For example in the mogreps dataset the air temperature at ground level has a different variable name `air_temperature` or `air_temperature0` up to `air_temperature3` in each file with the other variables present but representing different data such as the max, min, at height, etc. In this scenario the amount and complexity of checking you need to do on first data access is likely enough to prohibit this approach.

Another caveat is that files must be named such that given a point in theoretical space you can identify the url/file path that data lives in. So far my experience indicates this is true for many data sets.

Conclusions

This is an interesting way of approaching the problem, however in the end it is still serialising all the metadata about the full data set but is doing so in code rather than some other format. It may be that the speed advantage could out way the fragility for datasets an analyst knows and trusts but for robust shareable datasets its likely not the way. Serializing as code seems an odd choice and I think that the operations being done here would be better represented in some other format for sharing/reuse. NcML is one interesting option for storing this aggregation metadata as is CF aggregation. These are both options we are keen to explore further.  One aspect of this work, the ability of proxy objects to replace and or map missing data feels like an exciting tool to wrangle messy datasets into shape and warrants keeping in mind.