One of the ways companies often run into trouble with data lakes is trying to use them as a data warehouse. It’s a “terrible idea, unless it works,” says Merv Adrian, research vice president at Gartner.
It’s an issue that Mark Stange-Tregear, vice president of analytics at Ebates, knows all too well. When Stange-Tregear joined Ebates a little over four years ago, the company didn’t have much of a business intelligence (BI) infrastructure beyond a single SQL server and a handful of data engineers taking a replica of the main production database. They were struggling with the extract, transform, and load (ETL) process.
“ETL cycles were running 28 hours. Team members couldn’t get the reports or information they needed on a regular basis. We were hitting concurrency limits. It was clearly becoming unstable,” Stange-Tregear says.
A data lake built on a Hadoop cluster looked like the right solution, both from a cost standpoint and Ebates’ vision for the future. The company would be able to land all its data in one place and make it available without having to reblend it and handle multiple silos.
Stange-Tregear’s team wrote its core ETL processes in Python and within a few months were able to get core executive reporting out of the new data lake.
“From there, we got executive buy-in,” Stange-Tregear says. “We were getting their reporting in a much more quick and efficient manner. We started moving everything else over. It was a long tail, but eventually we shut all of our SQL Servers down.”
The ETL bottleneck
Within its single Hadoop cluster, Ebates has two distinct areas of data, Stange-Tregear explains. One is what the company refers to as the data lake: a very clean copy of the production databases. The team doesn’t do much in the way of transformation or cleanup on this data: The tables in the data lake look almost exactly like the data in the production databases, Stange-Tregear says.
The other part of the cluster is what Ebates refers to as the data warehouse, which is the cleaned, rationalized, joined data it uses for most of the heavy reporting.
“We pre-aggregate data from the data lake into the data warehouse,” Stange-Tregear says. “We do report directly from the data lake, but you can improve performance if you do some preprocessing.”
For a time, this BI infrastructure was the answer to the challenges Stange-Tregear identified when he first took his position with Ebates. But as use of the Hadoop cluster soared, new challenges started to mount.
“At a certain point, if you’re doing multiple different types of work on the same machines, you run into competition for resources. You can only do so much reading and writing until you have competition at the disk level,” Stange-Tregear says.
One of the benefits of a Hadoop cluster is you can add more machines to boost the processing power. You can mask the problem for a time that way, Stange-Tregear says, but ultimately the problem persists.
“Because we allow what we would think of as ad hoc workloads against the cluster, that’s relatively uncontrolled and unpredictable,” he explains. “If someone does a crazy number of joins, the cluster will accept that and start working on it, potentially interfering with your ETL workload, scheduled email sends, routine reporting, to the point where it ultimately takes the cluster down.”
Stange-Tregear notes that today’s Hadoop technology is very stable: “You have to do something completely obnoxious to the cluster to send it into a tailspin. But we now do that a couple of times a day.”
There are two obvious ways to try to fix this. One way is to try to find the offending type of workload and get people not to submit them.
“That is extremely difficult to do while also trying to make business intelligence available to everyone,” Stange-Tregear says. “You don’t really want to control it, but I want to control it. We haven’t been able to eliminate those problematic inserts. My analytic teams are probably the chief offenders, and it’s unpredictable when you’re going to cause that kind of load.”
The other way, and the one Gartner’s Adrian recommends, is splitting the ETL processing and the ad hoc query workload onto different sets of hardware. That’s a bitter to pill to swallow, Stange-Tregear concedes, because it feels like going against the original vision, going back to a siloed data infrastructure. But he notes that with a separate ETL cluster and data lake cluster, Ebates can copy the data from one to the other without adding too much ETL overhead.
ETL in the cloud
The decision facing Ebates now is whether to stand up a second on-premises Hadoop cluster for ad hoc reporting, or to put that second cluster in the cloud.
“The cloud question is often a bigger one than the data lakes question,” Gartner’s Adrian adds. “It often makes a lot more sense to put it in the cloud. It’s OpEx instead of CapEx. Someone is backing it up and patching it. That’s a big, big advantage. For data science workloads, perhaps the biggest advantage is in the cloud I’m able to leverage the separation of compute and store.”
For production jobs, the benefits of cloud are not as clear-cut, Adrian says. Production environments run hot most of the time, which means the meter is always running and the cost advantages of cloud are not as great.
One of the things holding back Hadoop in the cloud using HDFS is that it’s not really been cost effective, says David Mariani, founder and vice president of technology at AtScale, formerly vice president of development, user data, and analytics at Yahoo.
“Data is stored on the nodes themselves. Compute and storage are coupled. You can’t ever turn off those nodes. They’re running 24-7, which means the meter is running 24-7,” Mariani says.
AtScale’s answer is to store the data directly in Amazon S3, with AtScale’s middleware connecting the data to compute and BI tools like Tableau at need. That, Stange-Tregear says, is what’s really opening up cloud as a contender for solving Ebates’ need.
“We want to be able to query directly off the data sets,” he explains. “Can we query directly off the S3 buckets without having to do an additional load?”
“It’s a matter of crawl, walk, run,” Stange-Tregear says. “We have our processors built on an on-premise solution. What do we do first? The obvious solution is to shift the reporting to solution into the cloud first. If that works, now do we do a second step of moving all our ETL processing to the cloud? If the reporting side of things works, we may well work in the direction of moving our ETL into the cloud.”