OpenStack was initially created by NASA and RackSpace. Today’s it’s powering modern IT infrastructure in public and private clouds. Giants like AT&T are building their future networks on top of OpenStack. Giants like China Mobile and T-Systems (a subsidiary of Deutsche Telekom) are building massive public clouds on top of OpenStack. OpenStack is running the world. I would go so far as to say that OpenStack is the Linux of infrastructure and cloud world.
It’s not just businesses using OpenStack; the scientific community is a heavy consumer too. One of the most ambitious science projects in human history, the Square Kilometer Array Project (SKA), is looking at OpenStack to solve their problems.
Dr Rosie Bolton, SKA Science Data Processor Consortium Project Scientist and Fellow at Selwyn College, Cambridge, delivered a keynote speech at OpenStack summit where she talked about the amazing work that is going to happen in astrophysics and the technical challenges it poses for open source projects like OpenStack.
SKA is a billion dollar-plus project to build a 50 year lifetime radio observatory that’s expected to be built in Australia and South Africa.
The first phase of SKA is expected to be finished in 2023. It will consist of two separate instruments and one observatory. These instruments (interferometer) will be an array of hundreds of dish antennas spread out in clusters in the Western Australian and South African desserts, collecting radio signals from deep space.
SKA will be much more sensitive than any radio telescope that we have today that will allow us to look even deeper into space. Dr. Bolton explained that if we can collect photons from very far away, we can see far back into the beginning of the universe. SKA will be able to put it right back to the time when the first stars were ‘switching on’. Scientists will use the fact that SKA is going to be designed with 65,000 frequency channels, to distinguish different emissions from different parts of the universe very clearly.
An overall goal of SKA is to build up a survey of around a billion objects, in 3 dimensions, so we can look at how the structure of the universe has evolved over time, and compare it to our cosmological models to see if it’s behaving as we have expected it to do, and if not, to amend our models.
With that kind of sensitivity, you can also look quickly. Dr. Bolton said that pulsars are a nice tool for doing that. Those who don’t know, a pulsar is a dead star, with the same mass as the sun, spinning very, very rapidly at a very high angular rate. It often has a magnetic field misaligned to the spin access which create a beam of radio emission similar to a lighthouse beam. There are many pulsars with light beams lined up with Earth, and we see them like a clock, ticking along.
Dr. Bolton predicted that with SKA we will find every single pulsar in our galaxy that is pointing towards us. We can choose the best ones that are spread out across the galaxy and then look at how their timing pips come in regularly. “If a gravitational wave works its way across the fabric of our galaxy, some of the pulsars from one side will have the metric of spacetime squashed between us and them. On the other side, they’ll have the metric of space time stretched. That means that there will be an offset in the time delays of the pips coming in one side to the other side. We’ll see one half of the sky coming in early, whilst the other half is coming in late. We’ll be able to infer a gravitational wave rippling through the galaxy. I think that’s pretty nice science,” she said.
OpenStack at the center of universe?
These instruments, installed in the remote deserts of Australia and South Africa, will generate a massive amount of data. Jonathan Bryce, Executive Director of the OpenStack said that once SKA goes online it will be generating over 5,000 petabytes of data per day. That’s 5,242,880,000 GB. A HD movie is around 5GB in size, imagine how many HD movies SKA could be creating every single day.
To give us a perspective, Bryce quoted Google Chairman Eric Schmidt who once said that since the dawn of time until 2003, humanity had created over 5,000 PB of data. Here we are talking about collective data created over a period of 5 or 6 thousand years of human civilization. But SKA will be generating that magnitude of data themselves, through a single application, every single day.
Generating that much data is not the only challenge. This data has to be processed, downsampled, then shipped around for research. It will be shared with all institutions that are participating in the SKA project.
These arrays are installed in remote deserts and power can make it really expensive to run HPC (high performance computing) centers at the location. They will be setting up Science Data Processor centers in Perth and Cape Town for each site. The raw data generated per day will be trimmed down to 50 PB per day and then transfer it to the SDP centers in Perth and Cape Town for processing.
That’s where OpenStack enters the SKA gravitational field. Lauren Sell, VP of Marketing & Community Services at OpenStack Foundation wrote in an OpenStack blog that all in all, they (SKA) will need to build a 250PetaFLOP system to analyze and store the data, and are looking to OpenStack as a framework to support the computing power locally, as well as potentially supporting RSCs too.
Bryce is excited about the possibility of OpenStack being used by SKA. He said that SKA is basically a distributed software defined telescope, so the infrastructure and compute component are a vital piece of scientific component itself. This is the kind of app-cloud area Bryce is excited about.
It’s not really about running something on a server that’s not in your datacenter or whether it’s automated or not; the general discussion we hear around OpenStack. The real deal is tying infrastructure and software together to get such capabilities to work on these kind of immense problems.
OpenStack is not the only open source project that SKA is considering, they are also looking at other open source technologies such as Apache Spark. Lauren Sell, VP of Marketing & Community Services at OpenStack Foundation is excited about the proposition as it will create opportunities for OpenStack to collaborate with those open source communities to achieve what SKA is aiming for.
SKA has not yet picked OpenStack but it heavily depends on whether the OpenStack community is willing to take up these challenges. “Ultimately, for economics, flexibility and speed, the SKA team wants to rely on a distributed system on commodity gear, not a converged appliance. OpenStack has a huge opportunity to power this research, but the community will need to continue to incorporate recommendations of the Scientific Working Group, and others, as we plan the development road map over the next few years,” blogged Sell.
The challenges for the OpenStack community are huge. “The first is complexity,” Dr. Bolton said, “We have multi access data sets. We have intuitive converging pipelines that need to run. We have to be able to predict how much time they will take to run. We need about half an exaflop of compute to do this. That’s quite big. We have to orchestrate the ingest, the processing, and the control, the preservation and delivery of these data products. We have to keep up with the incoming data.”
As a publicly funded science project, SKA has its own financial constraints. The first phase has a budget of 650 million Euros and out of this fund, around 10% will be used for the science data processor. They also have to make it power efficient, so data centers need to be really efficient.
“We can’t afford to switch on all of the compute that we might need. We need to make things much more efficient than our current assumption of a 25% efficiency. We need to find ways of making things scale better. We also have to design a system that has to allow for software and hardware refreshes over the 50 year lifetime. When we think about the regional center, and the delivery of the products to the scientists, we have to consider which facilities might be available in national infrastructure projects as well, and how we build a federated system for that.”
If this challenge is not enough to get the OpenStack community excited, Dr. Bolton added that “We have 400 gigabytes per second of data to ingest into the science data processor. Each graph of tasks for a 6 hour observation would have around 400 million tasks in it. We require around half an exaflop, in total, of peak. We need 1.3 zettabytes of intermediate data products for each 6 hour data set. These are data that get created and then destroyed every 6 hours. In terms of final products, we need a petabyte a day of science data products to deliver to the rest of the world.”
The OpenStack foundation has a Science Working Group that works closely with the scientific community to represent and advance their needs. SKA is posing some really big challenges to OpenStack and it will be interesting to see whether OpenStack will be able to scale beyond the private cloud and reach out to the stars and galaxies.