We’ve been working on cloud based suite of technology that we hope will let people work with huge data sets in a way that’s user friendly but powerful.
For a while now, we’ve been thinking about how we tackle the problem of “too-big-data”. The Met Office data archive is many tens of Petabytes and it’s growing ever faster as our computers get more powerful. Our current data analysis techniques can’t keep up with these data volumes. We need to think about analysis in a fundamentally different way so we can continue to find the useful information in data. We aren’t alone in this: “big-data” has become central to lots of industries, from health care to engineering and commerce.
Over the last couple of weeks, Jacob, I and the rest of the Lab dived into Infrastructure as Code, tonnes of Docker, Jupyter, Dask and more to see what we could come up with.
What’s wrong with the status quo?
In the past, computer chip manufacturers seemed to be able to make ever faster processors, meaning computers could do their sums faster, and we could generate and analyse more and more data.
We’re making a lot of data
At the Met Office, we have been using supercomputers to generate data such as our weather forecasts since the 1950s. Supercomputers are super because they comprise lots of different mini computers (or nodes). This allows us to divide jobs up into lots of chunks, send each chunk to one node, and do them all at the same time - a technique known as parallelisation. But currently we only use supercomputers to generate data, and not to analyse it.
Traditionally processor speed has been fast enough to let us analyse this data efficiently on a single chip (usually the analyst’s desktop computer). However, the relative power of the supercomputer, and therefore the data volume generated, has grown faster than these chip speeds. A few years ago, data analysis became compute bound, meaning desktop processors weren’t fast enough to analyse this data anymore.
Additionally, chip manufacturers have announced that we can’t expect chip speeds to increase in the future - we can’t just rely on ever faster chips to cope with these increasing data volumes.
Parallel processing to the rescue?
People have been dealing with this by parallelising their analysis calculations, which, for a while made-up for the stall in chip speed. However, the data volumes have now become so big that getting the data from storage to your nodes has become the new bottle-neck: the problem’s now data-transfer bound.
So, if it takes ages to get your data to your calculation, what’s the answer? Well, to take your calculation to your data, of course!
We split the data base into chunks and add a compute node to each chunk. Instead of just having parallelised compute, you now have parallelised data as well, which removes the final bottle-neck of retrieving data to your nodes.
A lot of very powerful systems, such as (Hadoop, Spark, Dask) have sprung up over that last few years to try and do something like this.
What have you done?
Meet J.A.D.E. - the Jupyter and Dask Environment, (or Jupyter Analysis of Data Environment, Just Another Data Environment, Jolly Advances Data Engine…I dunno, you decide). It lets analysts use Jupyter Notebooks (interactive data analysis in a web page) to write big distributed data analysis calculations.
The guiding principle is this:
Reduce the data as much as possible as soon as possible
We’ve prototyped a system using Amazon Web Services, but the principles of the design are platform agnostic. Here’s the system architecture diagram, with some example data volumes at each stage of the process. (WARNING: if you aren’t interested in architecture diagrams don’t be put off. I’ll point out the cool bits after).
The cool bits
Infrastructure as Code
JADE is made up of lots of separate servers all carefully wired together (we are using AWS for this prototype). We’ve automated this process using a language called Terraform. We can now start, stop, tweak, reproduce, share and move this system at the click of a button - an approach called Infrastructure as Code.
Efficient data transmission
If you look at the flow on the right of the diagram you can see some example data volumes at different stages in the analysis process. Our imaginary example analyst wants to analyse 3TB of data. In the past, this would mean pulling the data to their desktop for analysis in manageable chunks. Using Jade, you can load each node with data from S3, which is highly optimised. (Ultimately, these nodes could be preloaded with data, such as in an HDFS). All the data is analysed to reduced it to ~0.5 GB at the data archive (i.e. on AWS, in this prototype), and all that is transferred across this internet to the client is data visualisation, the ultimately reduced data, weighing in at ~1 MB. This keeps data-transfer time to a minimum.
Jupyterhub + Docker Swarm
This lets us automatically create new work environments for users as they log on to the system. Each user works in their own Docker scientific environment (ours is here), meaning they can install any software they want, and generally tweak their system as much as they like without being a danger to other users.
We got a lot of great advice from this blog post.
Because deployment of Notebook servers and data node servers is standardised, it is easy to start extra services if the need arises. We’ve used AWS auto-scaling so that new servers will start and stop to pick up any spike in demand for Notebooks. We hope to extend this kind of functionality to the data nodes as well.
We’ve tried many of the different big-data engines out there. Currently, we’re really excited by new-kid-on-the-block: Dask, which looks like it might be more suitable for us than the two really well established platforms, Hadoop and Spark.
Our scientists are always dreaming of new involved ways to process our data. As such, we wanted a platform which is really flexible and adaptable. It also needs to be able to work with a wide variety of esoteric atmospheric data formats. Ideally we want to allow analysts to explore the data interrogatively i.e. in an interactive session where they can adapt the analysis based on intermediate results. In addition, our scientists mostly work with the Python language, so any solution needs to be able to play nicely with Python.
Hadoop’s primary aim seems to be to perform relatively standard batch jobs on text data. It’s not designed to apply arbitrary code to arbitrary data. Spark is far more interactive than Hadoop, letting you apply successive analyses. Spark can work with Python, but is naturally geared towards Java.
Dask seems to tick all the boxes: its written in Python and allows really in-depth bespoke analyses.
We think that systems like this will be the essence of how we analyse data in the future. We hope to bake Dask type power into the tools our scientists already use. If you’re interested in working with us then get in contact.