High Performance Data Analytics: A Q&A with Subject Matter Experts\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\nHigh performance data analytics (HPDA) is an emerging field born from the convergence of high performance computing (HPC) and data analytics in the cloud era. To explore the HPDA concept, Walker Stemple, an enterprise account manager for Intel, recently sat down for an interview with subject matter experts Bill Magro, an Intel Fellow and chief technologist for HPC software at Intel, and Jimmy Pike, a Dell EMC fellow at Dell EMC who focuses on HPC, machine learning and computing at the edge. This interview has been edited for length and clarity.\nWalker:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 What are the implications for the convergence of big data, analytics, HPC and cloud? And how are customers going to leverage these in the coming years with the digital landscape and how that\u2019s changing?\nJimmy:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 People are confused by the convergence of HPC, machine learning and all these things. I\u2019m confused about it, too. In reality, a lot of the things we\u2019ve talked about in the past as HPC or as predictive analytics have all run together. High-performance computing really is high-performance and how it\u2019s applied (whether it be data analytics, machine modeling or simulations), and more importantly the application of the technology for a specific vertical or a specific outcome. I think that\u2019s what people are really interested in understanding.\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 We\u2019ve seen high performance computing grow, certainly, in the supercomputer and the college and university space. But, really, many of the departmental-size HPC or divisional-size HPC clusters are where the work is done. For example, by the end of this year, we\u2019ll be able to do human genome sequencing in under an hour, or something like that, in your doctor\u2019s office. Three years ago, that was unheard of.\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 By the way, I have one quip. \u201cDo you know what high-performance guys call big data?\u201d \u2026 The answer: Data. It\u2019s always been big, and there\u2019s always been a problem in how you deal with the massive size of data. We\u2019re going to see nothing less than a gigantic explosion of data in the future. Hence, all these things are going to be a real problem for us.\nBill:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 You made a great point there, Jimmy, that big data has just been data in HPC, but analytics has risen up out of the business and the IT arena. As the data grew, people developed new tools to tackle that, like Hadoop and Spark. What happened is that it grew up in a way that was largely disjointed from the advancements and techniques used in HPC. As a result, we\u2019re now at a point where we\u2019ve got big data on both sides, but the systems are largely incompatible. The folks who have been doing analytics with these newer methods, like Hadoop and Spark, actually are not getting access to the high performance infrastructure and the techniques that have been available for those doing simulation and modeling for years and years. So that\u2019s actually a big challenge, but also a great opportunity, to bring these worlds together. And that\u2019s what we mean when we talk about convergence \u2014 bringing those things together on common systems, taking advantage of high-performance infrastructure that powers all those different workloads, and starting to support workflows that actually accelerate insights and lead to better business results for enterprises.\nWalker:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 I think another challenge of that revolves around best practices and learnings from others in the industry on how to best incorporate those tools to solve problems. Community meetings tend to be a great way for people to share. What other ways have you seen best practices and sharing information on solving these problems?\nBill:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 I\u2019ve seen both communities \u2014 the big data community and the traditional HPC community \u2014 initially coming in very comfortable with and very proud of what they\u2019ve accomplished, and they have good reason to be proud. There was some early skepticism that Jimmy touched on \u2014 HPC people just called big data \u201cdata.\u201d But I think people are finding that there really are opportunities by cross-pollinating. The HPC users doing simulation and modeling can be scientists, engineers, product development engineers. They are now seeing the advantages of the new tools that came out of big data, and the faster insights they can get by applying analytics to their simulation data. Similarly, we see people who\u2019ve traditionally done analytics realizing that their performance is limited by the tools they are applying \u2014 by the storage systems, by the fabrics, by the messaging systems, by the programming models. We now see them wanting to take advantage of the capabilities of HPC to accelerate their insights. So these worlds are coming together.\nJimmy:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 One of the things that I\u2019ve seen the most is the preponderance of open source tools people are providing and the blurring back and forth between what happens in high performance computing and what happens in big data. Things are specifically developed for a unique problem, then we\u2019re pretty good in this industry in saying, \u201cHey, look what that guy did. I wonder if we can apply it here.\u201d And more times than not, the answer is yes you can.\nWalker:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Jimmy, how is Dell EMC defining high performance data analytics?\nJimmy:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 High performance data analytics is applying those high performance computing techniques we\u2019ve developed to big data. The explosion of data is undeniable. IDC predicts that we will have 44 zettabytes of data around in 2020. By 2025, they predict it will be 185 - 200 zettabytes of data. That\u2019s amazing growth. The only way you can deal with that kind of data growth is by applying high performance computing techniques. For example: In the scientific community, you can use machine learning to minimize the amount of simulation you have to do based on some aspect of what a problem presents \u2014 to try to basically eliminate a large piece of the computation that would otherwise have to be done if you had to go through every step, one after another.\nWalker:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Bill, to Jimmy\u2019s point, have you seen better uptake in your discussions with customers around leveraging data analytics to formulate new use cases and problems in the scientific community?\nBill:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Absolutely. There\u2019s tremendous interest. I think people are seeing there is something new there. Some of the analysis techniques are new. The emergence and maturing of machine learning has been tremendous, and everywhere you go there is interest in that. Scientists are eager to do exactly what Jimmy said. In the past, you would run a wide array of simulations and study that data to look for trends. Why not use the machine to look for the trends? Why not use it to prune the search tree? Cull out some of the calculations and focus your computing power, which really is a scientific instrument, at the highest probability and most valuable simulation. So there\u2019s definite interest from that side. From the other side, the folks doing analysis with these new techniques want to take advantage of the high performance infrastructure. Analysis tools and problems now can benefit from high performance infrastructure, so that\u2019s what HPDA is about. It\u2019s about bringing them all together.\nJimmy:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 With the data that we\u2019re trying to see and the analytics around data, the problems are extraordinarily parallelizable, and the same techniques can be applied to lots and lots of data. I think that we\u2019re going to see continued growth of that, of how we can do more and more things at the same time, with larger clusters providing the results.\nBill:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 There is a tremendous amount of parallelism. As the data grows, the parallelism is going to grow, and the opportunities for parallelism are going to grow.\nJimmy:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 I\u2019m sometime amused by the people who quote the part of Amdahl\u2019s law that says parallelism gives you scalability, which it does, but there are two parts: the serial part and the parallelizable part. We tend to focus on the parallelizable part and ignore the serializable part or the time it takes to basically dispatch all this work, and at the end of the day, or the end of the period, the time it takes to coalesce those results. After a while, those pieces become larger than the advantage you get through parallelization, and that\u2019s where we need these intricate solutions, like a deep learning network, or like a multi-level network, where you can infer things in this dataand potentially improve performance.\nWalker:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 It really is. It\u2019s all about the data. And one thing that is not slowing down is the rate of generation of data and the desire to consume that data in intelligent ways. IoT is a really good example, with all these edge devices generating tons of data. HPC Wire recently wrote that 77 percent of HPC end users report that data and storage is now the most strategic part of HPC data centers. Can you elaborate on why this might be the case, based on your collaboration with IT leaders?\nBill:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 I think the reason people are seeing data as so strategic is because there\u2019s so much opportunity to collect it, there\u2019s so much opportunity to generate it, and the reality is you can\u2019t store all of it. And so the strategic piece of it is deciding what to keep, and in the past it was just based on experience, it was based on judgment \u2014 I\u2019m going to keep this, I\u2019m going to throw that away. You can\u2019t keep all the data from your calculations. People know now that there\u2019s information being lost, and so they want to use the techniques to do analysis on the fly, to figure out if they are heading in the right direction \u2014 did something interesting just happen over here that I need to pursue? If so, they can save the data that they otherwise would have thrown away. That\u2019s really, really important. The other thing, of course, is once you put data down and decide to save it, it\u2019s critically important where you put it, because it\u2019s going to be there for a long, long time, and it\u2019s tremendously expensive in terms of dollars, energy and time to move it. So it\u2019s very strategic to think about what data you keep and where you put it so you can do analysis later.\nWalker:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 The auto industry is one of the areas where we\u2019ve seen a lot of interest recently in data analytics and high performance computing. Manufacturing has always been a heavy consumer of HPC to differentiate products, but now manufacturers are leveraging analytics in these new processes to innovate and drive real changes in the industry. So what other ways in other industries have you seen decision making change with respect to how they\u2019re consuming analytics and HPC?\nBill:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 We see things in financial services, where fraud detection has really been able to be amped up. That\u2019s not entirely new \u2014 they\u2019ve been doing analytics all the way back to the early 90s \u2014 but some of the techniques are new. They\u2019re able to detect fraudulent credit card uses, outliers, unusual behaviors, and trigger much faster, so that\u2019s bringing business value. In the area of medicine, doctors are not going to abandon the techniques they\u2019ve used in the past, but they are gaining the ability to do faster reconstructions on medical imaging data, the ability to use machine learning to spot things that maybe a radiologist would have difficulty spotting. The same is true with a lot of infectious diseases. Being able to augment what they do today, and bring in the power of HPC on high performance infrastructure, is really causing breakthroughs and advancements in these fields.\nWalker:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 What industries are ripe for disruption by the convergence of big data and high performance computing?\nBill:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Retail is one for sure. If you look at what Amazon is doing now, piloting with their grocery stores where you can just walk in and walk out, that\u2019s video analytics being applied in a new way. People have had video cameras in retail establishments forever, recording and discarding. Well, now people are applying these high-performance machine learning algorithms directly to the video in real time.\nJimmy: \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 The question is, \u201cIs the industry that has serviced society going to realize that our sociology has moved on?\u201d I think that\u2019s one of the reasons why Amazon has been so successful in the retail space, because they provide an easy way for you to see what you want. You don\u2019t have to go anywhere and then they provide an easy way for you to get it. So it\u2019s all about how easy it is to for the human to interact. The people who take advantage of these tools and make them easy, they\u2019re going to win.\nWalker:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Let\u2019s finish with a few final thoughts. What is really getting you out of bed to come in and think about the future of technology and how it applies to the human challenges that we see in the future?\nBill:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Well for me, I still feel we\u2019re really far away from having taken advantage of and maximizing this opportunity. The worlds of HPC and data analytics developed independently, and so there\u2019s a lot of technological baggage that\u2019s preventing things from coming together. I visit customers, and they say, \u201cWhy do I have to buy this one machine for Hadoop and a separate machine for my simulation and modeling?\u201d And the answer is that you shouldn\u2019t have to. The problem is that the technology is developing so fast that nobody can slow down to make the changes and agree on some conventions that would allow us all to go faster. Just the coexistence of traditional HPC workloads with big data workloads on a common machine means you have to have some conventions in resource management. It\u2019s easy to sit around and talk about the opportunities and trends, but there\u2019s a lot of hard work that needs to be done to actually realize the potential. And that\u2019s what gets me up in the morning, going out trying to drive those changes, so we can see this stuff meet its potential.\nJimmy:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 I agree. As technologists, we always assume that technology will transition before it actually does. But, on the good side of that, I have never seen as much change as there is right now. Virtually everything is changing, and with change comes great chaos, and with great chaos comes opportunity. It\u2019s really just our imagination that is going to limit us.