High Performance Data Analytics: A Q&A with Subject Matter Experts
High performance data analytics (HPDA) is an emerging field born from the convergence of high performance computing (HPC) and data analytics in the cloud era. To explore the HPDA concept, Walker Stemple, an enterprise account manager for Intel, recently sat down for an interview with subject matter experts Bill Magro, an Intel Fellow and chief technologist for HPC software at Intel, and Jimmy Pike, a Dell EMC fellow at Dell EMC who focuses on HPC, machine learning and computing at the edge. This interview has been edited for length and clarity.
Walker: What are the implications for the convergence of big data, analytics, HPC and cloud? And how are customers going to leverage these in the coming years with the digital landscape and how that’s changing?
Jimmy: People are confused by the convergence of HPC, machine learning and all these things. I’m confused about it, too. In reality, a lot of the things we’ve talked about in the past as HPC or as predictive analytics have all run together. High-performance computing really is high-performance and how it’s applied (whether it be data analytics, machine modeling or simulations), and more importantly the application of the technology for a specific vertical or a specific outcome. I think that’s what people are really interested in understanding.
We’ve seen high performance computing grow, certainly, in the supercomputer and the college and university space. But, really, many of the departmental-size HPC or divisional-size HPC clusters are where the work is done. For example, by the end of this year, we’ll be able to do human genome sequencing in under an hour, or something like that, in your doctor’s office. Three years ago, that was unheard of.
By the way, I have one quip. “Do you know what high-performance guys call big data?” … The answer: Data. It’s always been big, and there’s always been a problem in how you deal with the massive size of data. We’re going to see nothing less than a gigantic explosion of data in the future. Hence, all these things are going to be a real problem for us.
Bill: You made a great point there, Jimmy, that big data has just been data in HPC, but analytics has risen up out of the business and the IT arena. As the data grew, people developed new tools to tackle that, like Hadoop and Spark. What happened is that it grew up in a way that was largely disjointed from the advancements and techniques used in HPC. As a result, we’re now at a point where we’ve got big data on both sides, but the systems are largely incompatible. The folks who have been doing analytics with these newer methods, like Hadoop and Spark, actually are not getting access to the high performance infrastructure and the techniques that have been available for those doing simulation and modeling for years and years. So that’s actually a big challenge, but also a great opportunity, to bring these worlds together. And that’s what we mean when we talk about convergence — bringing those things together on common systems, taking advantage of high-performance infrastructure that powers all those different workloads, and starting to support workflows that actually accelerate insights and lead to better business results for enterprises.
Walker: I think another challenge of that revolves around best practices and learnings from others in the industry on how to best incorporate those tools to solve problems. Community meetings tend to be a great way for people to share. What other ways have you seen best practices and sharing information on solving these problems?
Bill: I’ve seen both communities — the big data community and the traditional HPC community — initially coming in very comfortable with and very proud of what they’ve accomplished, and they have good reason to be proud. There was some early skepticism that Jimmy touched on — HPC people just called big data “data.” But I think people are finding that there really are opportunities by cross-pollinating. The HPC users doing simulation and modeling can be scientists, engineers, product development engineers. They are now seeing the advantages of the new tools that came out of big data, and the faster insights they can get by applying analytics to their simulation data. Similarly, we see people who’ve traditionally done analytics realizing that their performance is limited by the tools they are applying — by the storage systems, by the fabrics, by the messaging systems, by the programming models. We now see them wanting to take advantage of the capabilities of HPC to accelerate their insights. So these worlds are coming together.
Jimmy: One of the things that I’ve seen the most is the preponderance of open source tools people are providing and the blurring back and forth between what happens in high performance computing and what happens in big data. Things are specifically developed for a unique problem, then we’re pretty good in this industry in saying, “Hey, look what that guy did. I wonder if we can apply it here.” And more times than not, the answer is yes you can.
Walker: Jimmy, how is Dell EMC defining high performance data analytics?
Jimmy: High performance data analytics is applying those high performance computing techniques we’ve developed to big data. The explosion of data is undeniable. IDC predicts that we will have 44 zettabytes of data around in 2020. By 2025, they predict it will be 185 – 200 zettabytes of data. That’s amazing growth. The only way you can deal with that kind of data growth is by applying high performance computing techniques. For example: In the scientific community, you can use machine learning to minimize the amount of simulation you have to do based on some aspect of what a problem presents — to try to basically eliminate a large piece of the computation that would otherwise have to be done if you had to go through every step, one after another.
Walker: Bill, to Jimmy’s point, have you seen better uptake in your discussions with customers around leveraging data analytics to formulate new use cases and problems in the scientific community?
Bill: Absolutely. There’s tremendous interest. I think people are seeing there is something new there. Some of the analysis techniques are new. The emergence and maturing of machine learning has been tremendous, and everywhere you go there is interest in that. Scientists are eager to do exactly what Jimmy said. In the past, you would run a wide array of simulations and study that data to look for trends. Why not use the machine to look for the trends? Why not use it to prune the search tree? Cull out some of the calculations and focus your computing power, which really is a scientific instrument, at the highest probability and most valuable simulation. So there’s definite interest from that side. From the other side, the folks doing analysis with these new techniques want to take advantage of the high performance infrastructure. Analysis tools and problems now can benefit from high performance infrastructure, so that’s what HPDA is about. It’s about bringing them all together.
Jimmy: With the data that we’re trying to see and the analytics around data, the problems are extraordinarily parallelizable, and the same techniques can be applied to lots and lots of data. I think that we’re going to see continued growth of that, of how we can do more and more things at the same time, with larger clusters providing the results.
Bill: There is a tremendous amount of parallelism. As the data grows, the parallelism is going to grow, and the opportunities for parallelism are going to grow.
Jimmy: I’m sometime amused by the people who quote the part of Amdahl’s law that says parallelism gives you scalability, which it does, but there are two parts: the serial part and the parallelizable part. We tend to focus on the parallelizable part and ignore the serializable part or the time it takes to basically dispatch all this work, and at the end of the day, or the end of the period, the time it takes to coalesce those results. After a while, those pieces become larger than the advantage you get through parallelization, and that’s where we need these intricate solutions, like a deep learning network, or like a multi-level network, where you can infer things in this dataand potentially improve performance.
Walker: It really is. It’s all about the data. And one thing that is not slowing down is the rate of generation of data and the desire to consume that data in intelligent ways. IoT is a really good example, with all these edge devices generating tons of data. HPC Wire recently wrote that 77 percent of HPC end users report that data and storage is now the most strategic part of HPC data centers. Can you elaborate on why this might be the case, based on your collaboration with IT leaders?
Bill: I think the reason people are seeing data as so strategic is because there’s so much opportunity to collect it, there’s so much opportunity to generate it, and the reality is you can’t store all of it. And so the strategic piece of it is deciding what to keep, and in the past it was just based on experience, it was based on judgment — I’m going to keep this, I’m going to throw that away. You can’t keep all the data from your calculations. People know now that there’s information being lost, and so they want to use the techniques to do analysis on the fly, to figure out if they are heading in the right direction — did something interesting just happen over here that I need to pursue? If so, they can save the data that they otherwise would have thrown away. That’s really, really important. The other thing, of course, is once you put data down and decide to save it, it’s critically important where you put it, because it’s going to be there for a long, long time, and it’s tremendously expensive in terms of dollars, energy and time to move it. So it’s very strategic to think about what data you keep and where you put it so you can do analysis later.
Walker: The auto industry is one of the areas where we’ve seen a lot of interest recently in data analytics and high performance computing. Manufacturing has always been a heavy consumer of HPC to differentiate products, but now manufacturers are leveraging analytics in these new processes to innovate and drive real changes in the industry. So what other ways in other industries have you seen decision making change with respect to how they’re consuming analytics and HPC?
Bill: We see things in financial services, where fraud detection has really been able to be amped up. That’s not entirely new — they’ve been doing analytics all the way back to the early 90s — but some of the techniques are new. They’re able to detect fraudulent credit card uses, outliers, unusual behaviors, and trigger much faster, so that’s bringing business value. In the area of medicine, doctors are not going to abandon the techniques they’ve used in the past, but they are gaining the ability to do faster reconstructions on medical imaging data, the ability to use machine learning to spot things that maybe a radiologist would have difficulty spotting. The same is true with a lot of infectious diseases. Being able to augment what they do today, and bring in the power of HPC on high performance infrastructure, is really causing breakthroughs and advancements in these fields.
Walker: What industries are ripe for disruption by the convergence of big data and high performance computing?
Bill: Retail is one for sure. If you look at what Amazon is doing now, piloting with their grocery stores where you can just walk in and walk out, that’s video analytics being applied in a new way. People have had video cameras in retail establishments forever, recording and discarding. Well, now people are applying these high-performance machine learning algorithms directly to the video in real time.
Jimmy: The question is, “Is the industry that has serviced society going to realize that our sociology has moved on?” I think that’s one of the reasons why Amazon has been so successful in the retail space, because they provide an easy way for you to see what you want. You don’t have to go anywhere and then they provide an easy way for you to get it. So it’s all about how easy it is to for the human to interact. The people who take advantage of these tools and make them easy, they’re going to win.
Walker: Let’s finish with a few final thoughts. What is really getting you out of bed to come in and think about the future of technology and how it applies to the human challenges that we see in the future?
Bill: Well for me, I still feel we’re really far away from having taken advantage of and maximizing this opportunity. The worlds of HPC and data analytics developed independently, and so there’s a lot of technological baggage that’s preventing things from coming together. I visit customers, and they say, “Why do I have to buy this one machine for Hadoop and a separate machine for my simulation and modeling?” And the answer is that you shouldn’t have to. The problem is that the technology is developing so fast that nobody can slow down to make the changes and agree on some conventions that would allow us all to go faster. Just the coexistence of traditional HPC workloads with big data workloads on a common machine means you have to have some conventions in resource management. It’s easy to sit around and talk about the opportunities and trends, but there’s a lot of hard work that needs to be done to actually realize the potential. And that’s what gets me up in the morning, going out trying to drive those changes, so we can see this stuff meet its potential.
Jimmy: I agree. As technologists, we always assume that technology will transition before it actually does. But, on the good side of that, I have never seen as much change as there is right now. Virtually everything is changing, and with change comes great chaos, and with great chaos comes opportunity. It’s really just our imagination that is going to limit us.