To the Cloud and Back (Part 1)

BrandPostBy Keith Manthey
Oct 04, 2016
Analytics Big Data Hadoop

Imagine a world where analytics and computing resources are only bound by your imagination.

The cloud is a much praised, and also much maligned, term these days. On one hand, it allows businesses to grow their infrastructure in burstable capacities with just a few keystrokes. On the flip side, business users are turning to the cloud to avoid having to request resources and services from their IT partners. No matter which side the business user falls into, almost everyone is in agreement that the cloud is here to stay.

Businesses are quickly adopting the cloud for many workloads. Some workloads that involve websites or a software-as-a-service offering can be quickly ported to the cloud as their means for delivery. Others that involve large data sets can prove more challenging to transition quickly to the cloud. Analytics is a popular workload and one that can often require a large set of data. The larger the data set needed for analytics, the more problematic it can be to migrate to the cloud. The reason why data mobility for analytics can be such a challenge boils down into the term popularized as “data gravity.”

Data Gravity

The earth is bound by a series of unbending limits and temporary limits. Gravity is an unbending limit as defined by Newton’s Theory of Gravitation. Network speeds and storage access speeds are more temporary limits that regularly increase in throughput and capacity. Despite the expanding network capabilities, data gravity remains a current technological limit.

To understand what we mean by data gravity, we can look at a data set’s size, geographical location and network capacity emanating from it. A 1 petabyte data set (or 1 million gigabytes of data) configured in a highly connected data center with a 100-gigabyte network connection will take around a full day of totally dedicated networking to move. If the data set does not change often, and does not have a data currency need, then a day or two to move the entire data set might be acceptable.

In some cases, 1 petabyte of data is generated on a daily basis. If it takes a day to move a day’s worth of data, then the process may never complete. The larger the data set, and more demanding the network requirements are to move a data set, is what is referred to as data gravity. If a data scientist needs to move the data set between data centers, or between a data center and the cloud, data gravity would limit the data’s mobility. The effect of data gravity is often data silos.

Data Silos

As companies start to leverage the cloud, manual processes are often leveraged to migrate data. Automated scripts and other mechanical tools are employed to migrate data. Often, the data in the cloud is used to create more data via analytic methods. The end result is a larger data set in the cloud. Without effective data management and data movement techniques, data silos are created. Data silos are pockets of data that remains disconnected from the rest of the businesses data. Centralized management or centralized usage of the data is difficult with a data silo.

The reason for data silos being more prevalent in the cloud is the asymmetric model for access. The network capacity to ingest data into the cloud at rapid speed is one of the reasons for the rapid growth of cloud computing usage. The cost to extract data from the cloud is meant to be a deterrent from removing data from the cloud. This is referred to as an asymmetric model for access. It is easy to get data in and harder to get data out of the cloud. This is the main contributor for data silos in the cloud.

Data Mobility

Given the challenge of data gravity and data silos, data movement and more specifically data movement for analytics is taking on renewed interest. In many conversations that I have on a daily basis, customers struggle with moving data around their own worldwide data centers. The advent of external data centers for co-location and the cloud are just exacerbating the data mobility challenges. I have spent the last quarter discussing data mobility both with clients and internally amongst our product and engineering teams. Data mobility is becoming a challenge that will become more prevalent as the increased usage of cloud computing grows.

If network capacity were the only limitation, then the problem would be far more straightforward. The ability to enact policy-based data movement into and out of the cloud is a must. The ability to enable remote delete from the cloud is a must. Remote delete from outside the cloud brings its own “legal-ese” challenges as well.

The sum total of the whole discussion is a more strategic discussion around how software-defined storage, effective policy-based data movement and a robust data mobility strategy come into play. In the next edition of this blog, I will describe in more detail the efforts Dell EMC and the Isilon product are making to enable customers and their usage of hybrid analytics.

Keith Manthey is the CTO for Analytics at Dell EMC | Emerging Technologies Division.