by Thor Olavsrud

Data mining the stars: The virtualized telescope that transformed astronomy

Jan 10, 20198 mins
Data MiningDatabases

More than 15TB of queryable data generated by the Sloan Digital Sky Survey is allowing astronomers to shave years off their research projects.

In the 1990s, astrophysicist Dr. Alex Szalay and computer scientist Dr. Jim Gray had a brainstorm: What if a database could be turned into a virtual telescope that could then be data-mined? Open access to such data could revolutionize the field of astronomy.

With time, the idea would become the Sloan Digital Sky Survey (SDSS), an international collaboration of hundreds of scientists at dozens of institutions.

[ Learn the essential skills and traits of elite data scientists and the secrets of highly successful data analytics teams. | Prove your data science chops by earning one of these data science certifications. | Get the insights by signing up for our newsletters. ]

The goal was to index the sky using a dedicated 2.5-meter telescope at Apache Point Observatory in New Mexico. Equipped with a 120-megapixel camera, the telescope would image more than one-quarter of the night sky, 1.5 square degrees at a time. The project used Microsoft SQL Server as the back-end database.

From 1998 to 2009, the telescope operated in both imaging and spectroscopic modes. SDSS retired the imaging camera in 2009, but the telescope continues to observe in spectroscopic mode. The data is openly available through the SkyServer database, an online portal. Today, the database has a 15TB queryable public dataset, and about 150TB of additional raw and calibrated files.

Digitizing the stars

“In traditional astronomy, an astronomer had an idea for a project, but first, he or she needed to find targets,” explains Szalay, Bloomberg Distinguished Professor of Physics and Astronomy and Computer Science at the Johns Hopkins University School of Arts and Sciences and Whiting School of Engineering.

Sloan 2.5m Telescope

Sloan Digital Sky Survey 2.5-meter telescope at Apache Point Observatory

Before the SDSS, that was a time-consuming process. The astronomer would have to write a proposal and select a large area of the sky to explore possible targets to test the idea. If the proposal was accepted, the astronomer could book time at a telescope.

“In six months, if you got the time, you went out to the mountain top. If you are lucky, it’s not raining and not cloudy and you get some data to take back with you,” Szalay says.

From there, Szalay says an astronomer could take several months doing image processing on that data, finding maybe a couple of hundred targets. With targets in hand, the astronomer would put in a proposal for a bigger telescope to explore those targets in detail. After getting the telescope time and collecting the data, the astronomer would spend a few months reducing that data.

SDSS Galaxy Map

The SDSS’s map of the Universe. Each dot is a galaxy; the color is the g-r color of that galaxy.

“After two-and-a-half years, you’re at the point where you can actually test your ideas,” Szalay says.

The SDSS has changed all that. Astronomers now must learn how to write queries in SQL but doing so can speed up their research enormously.

“Now you can go into the website, point this virtual telescope at any part of the sky — you don’t need to do any reduction — and just pick out the targets you want,” Szalay says. “In five minutes you can dial up the sky and fit [the targets] into a bigger telescope right away. It takes several years out of the loop.”

Prior to SDSS, astronomers had data for fewer than 200,000 galaxies. Today, SDSS has data covering more than 220 million galaxies.

Gray, a Microsoft Technical Fellow who won the Turing Award in 1998 for his seminal contributions to database and transaction processing research, worked closely with Szalay and the SDSS until his disappearance while sailing in 2007. He was a major contributor to SkyServer and TerraServer-USA (which would become Microsoft Research Maps before its closure in 2016). Gray and Szalay developed spatial indexing techniques to perform data mining on the SDSS archive. Szalay notes that the spatial index he and Gray created would become integral to Microsoft SQL Server.

SDSS Orange Spider

The SDSS “Orange Spider”: This illustrates the wealth of information on scales both small and large available in the SDSS I/II and III imaging. The picture in the top left shows the SDSS view of a small part of the sky, centered on the galaxy Messier 33 (M33). The middle and right top pictures are further zoom-ins on M33. The figure at the bottom is a map of the whole sky derived from the SDSS image. Visible in the map are the clusters and walls of galaxies that are the largest structures in the entire universe.

“While building applications to study the correlation properties of galaxies, Szalay and his team have discovered that many of the patterns in their statistical analysis involved tasks that were much better performed inside the database engine than outside, on flat files,” write Joseph Sirosh, Microsoft corporate vice president, and Rimma V. Nehme, principal software engineer at the Data Group at Microsoft. “The Microsoft SQL Server gave them high-speed sequential search of complex predicates using multiple CPUs, multiple disks and large main memories. It also had sophisticated indexing and data joining algorithms far outperforming hand-written programs against flat files. Many of the multi-day batch files were replaced with database queries that ran in minutes thanks to a sophisticated query optimizer.”

Astronomy at scale

The SDSS has also democratized astronomy in a way. Before the project, only leading scientists and astronomers had access to telescopes and other instruments to collect data. Others had to make do with the data they made available. Sirosh and Nehme note that over the past 14 years, SkyServer has logged more than 1.6 billion web hits and has generated scientific discoveries ranging from the measurements of thousands of asteroids to maps of the merger history of the outer Milky Way. The data produced by SDSS has supported 5,800 papers and more than 245,000 citations. Szalay says about two-thirds of the world’s professional astronomy community makes use of SkyServer every day.

Today, scientists and astronomers are beginning to leverage machine learning and neural networks on the wealth of SDSS data to assist with tasks such as scrubbing noise from images.

The SDSS project is ongoing, but it may soon have a successor. The Large Synoptic Survey Telescope (LSST) is currently under construction in Chile. The plan is for the wide-field survey reflecting telescope to photograph the entire available sky every few nights for 10 years, beginning in January 2022. The images will be recorded by a 3.2-gigapixel CCD imaging camera. At 5.5 feet by 9.8 feet (about the size of a small car), it is the largest digital camera ever constructed.

Szalay, who is on the Science Advisory Council of the LSST, says that the LSST will be able to do in three nights what it took SDSS 8 years to do. It will generate a database of about 60 petabytes.

More on data science: