Credit: shutterstock In 2012 Geoffrey Moore tweeted, “Without big data analytics, companies are blind and deaf, wandering out onto the Web like deer on a freeway.” [1] Fast forward a decade and a lot happened in the 2010’s to deliver sight and sound. The storage industry brought innovation to solve the petabyte+ data challenge, the analytics software/toolkits ecosystem rapidly matured, and chip manufacturers delivered accelerated compute to glean insights from the ever-growing troves of data. But the quest for better insights is never over. In fact, the constantly increasing volume of data is forcing us to take analytics into hyperdrive. For the enterprise to stay competitive in 2021, they must continue to innovate. Below I describe four big data analytics trends I’m seeing, along with some suggested solution features to look for. Apache Spark will continue to dominate the big data world The classic data scientist is known as a badass; give her Apache Spark software with a Jupyter notebook and get out of her way. Apache Spark, a unified analytics engine for large-scale data processing, is now the Kleenex of big data analytics and data engineering. It’s ubiquitous – universities offer classes for it, every Hadoop deployment is leveraging it, the new Spark 3 operator brings native GPU capabilities plus S3 integration. Everyone needs to gear up for the Spark tsunami. However, a fair amount of thrash in this space causes confusion. Major vendors are forcing businesses to shift to the cloud and dump Hadoop File System (HDFS) for object storage. And a ton of other dedicated solutions are sprouting up to deliver engineered Spark solutions. The real challenge is figuring out how to easily bridge from Spark on YARN technology to the next-generation Spark on a Kubernetes implementation — without major disruptions to the existing environment. Businesses must also take into account that Spark is just one of many applications they need to support their analytics pipeline. What to look for? The goal is a solution that simultaneously improves efficiency, agility, elasticity while cutting costs and improving data exploitation capabilities. Ideally, this solution will let data scientists tap into existing data stores without having to move to the cloud or re-platform the data. On the application front, businesses will look to avoid vendor lock-in with multi-version, open-source Kubernetes support without dependencies on Hadoop or YARN. Stateful application modernization App modernization is still red hot, and usually people’s minds go straight to the microservices cloud native apps. But over the past 18 months, I’ve seen a radical shift in the open source, ISV, and even the monolithic analytics vendor space (think Splunk, Cloudera, and SAS). Businesses are now choosing to embrace the modernization of their applications to be deployed via container-native infrastructure. These traditionally stateful and data-centric workloads are looking to become more cloud-like by improving the efficiency of at-scale deployments and by gaining the elasticity and agility needed to deploy anywhere – in minutes. The challenge is figuring out the right modern home for these stateful applications. Data science and analytics are a team sport, so these applications will need to share data and models, while orchestrating hand-offs across the analytics lifecycle. What to look for? Businesses are going to quickly need staff that can do more than just spell Kubernetes, but there are ‘no-coding’ answers to this problem. They will need to look to leverage a container platform that can support (and hopefully is validated with) all these applications and can deliver data at petabyte scale. Businesses will also need to make sure their solution is based on open-source Kubernetes with proven hybrid-cloud capabilities so they can quickly move these workloads between on-prem and the public cloud. Solving for app dev and data-intensive workloads When I go camping, my Swiss army knife is always on my belt, but as the adage goes, a jack of all trades is a master of none. Therefore, I also pack a hammer and hatchet for when the specialty need arises. I’m noticing this same thing from the container offerings. You may have already invested in a technology that is particularly good from the app developer perspective and are now trying to stretch that tool to new spaces. The challenge is that we all want to minimize solution providers, so we optimistically believe our vendors when they advocate for us to use their tools for things they aren’t natively designed to do. Stateful apps are a different beast — running petabyte scale analytics is very different from running microservices web search. The scale of 100’s or 1000’s of clusters and/or hosts per cluster has fundamentally different requirements. What to look for? Use the right tool for the right job. Don’t be afraid of co-existing multiple platforms to complement your existing solutions and address your varied use cases to deal with scale, performance, and data gravity issues. On the data side, validated CSI drivers is a great start, but you may need a dedicated or integrated high-performance, scale-out data store. The edge is here, and you need to solve for both data AND security We’ve been reading about the billions of edge devices and IoT trends for years now, and I’m seeing more solutions that have actually operationalized data analytics from edge to cloud. In its simplest form, organizations are bridging their data center with the public cloud, others have brought tens of geo locations together, and others are able to collect data from millions of streaming devices — even in orbit. Following this trend, analytics are continually becoming more automated and distributed as they move towards the edge points of data creation. This creates a complex matrix of analytic edges that themselves are composed of interconnected workloads that come and go, interacting with each other over physical and logical limitations…much like today’s web interactions. Businesses face two inherent challenges in edge analytics. First, how do organizations seamlessly bring together data from the many edges, multiple clouds, and on-prem — while still providing a single, no-silo view of all the data? Secondly, how do businesses liberate analytics to exploit the data across a secure matrix that has no intrinsic attested identity? What to look for? Data: A solution that can deliver a common data fabric for all the enterprise’s data on a global scale means faster time to value, better governance, and lower cost. Look for data platforms with proven petabyte scale, hardened enterprise feature set, and proven capabilities (like a global namespace and auto data-tiering) to deliver data from edge to cloud. Security: A solution that can establish trust in the fluid, interconnected data landscape. Strategies of yesterday to develop trust amongst workloads, like perimeter-based secrets management, are just a band aid that works in the near-term but won’t scale. This strategy will leave the business vulnerable to attacks on the application estate that spans beyond the four walls of the data center. Instead, businesses need to look for technologies that can employ Zero Trust security to fully unlock their analytics over the next decade. Take analytics to hyperdrive in the 2020s Data will continue to be nothing without insights. Businesses can’t stand still – they will look to the 2020’s as the decade to take their analytics to hyperdrive. If you’re looking to learn more on this topic, please check out HPE’s on-demand videos from our popular event – HPE EzmeralAnalytics Unleashed. Numerous insightful videos from the event are now available including interviews with analysts, live demos, and a discussion with three of our clients about their analytics journeys. They reveal solutions such as a virtual wallet program, robotic drive for ADAS (advanced driver-assistance systems), and data science as-a-Service. [1] @geoffreyamoore. Twitter, 12 Aug. 2012, 7:29 p.m., https://twitter.com/geoffreyamoore/status/234839087566163968?s=20 ____________________________________ About Matthew Hausmann Matt’s passion is figuring out how to leverage data, analytics, and technology to deliver transformative solutions that improve business outcomes. Over the past decades, he has worked for innovative start-ups and information technology giants with roles spanning business analytics consulting, product marketing, and application engineering. Matt has been privileged to collaborate with hundreds of companies and experts on ways to constantly improve how we turn data into insights. Related content brandpost Sponsored by HPE How ML Ops Can Help Scale Your AI and ML Models Machine learning operations, or ML Ops, can help enterprises improve governance and regulatory compliance, automation, and production model quality. By Richard Hatheway Apr 07, 2022 7 mins Machine Learning IT Leadership brandpost Sponsored by HPE Edge Computing is Thriving in the Cloud Era Todayu2019s edge technology is not just bolstering profits, but also helping reduce risk and improve products, services, and customer experience. By Denis Vilfort, Al Madden Apr 06, 2022 11 mins Edge Computing Artificial Intelligence IT Leadership brandpost Sponsored by HPE 5 Types of Costly Data Waste and How to Avoid Them Poor choices in data infrastructure and data habits can lead to data waste u2013 but a comprehensive data strategy can help resolve the problem. By Ellen Friedman Mar 29, 2022 11 mins Data Center Management Data Architecture IT Leadership brandpost Sponsored by HPE 2022 is the Year of the Edge By Matthew Hausmann Feb 28, 2022 9 mins Data Science Edge Computing IT Leadership Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe