Disk storage is a lot like closet space—you can never have enough. Nowhere is this truer than in the world of big data. The very name—"big data"—implies more data than a typical storage platform can handle. So where exactly does this leave the ever-vigilant CIO? With a multitude of decisions to make and very little information to go by.
However, wading through the storage options for big data does not have to be an impossible journey. It all comes down to combining some basic understanding of the challenge with a little common sense and a sprinkle of budgetary constraint.
What Makes Big Data a Big Deal
First of all, it is important to understand how big data differs from other forms of data and how the associated technologies (mostly analytics applications) work with it. In itself, big data is a generic term that simply means that there is too much data to deal with using standard storage technologies. However, there is much more to it than that—big data can consist of terabytes (or even petabytes) of information that can be a combination of structured data (databases, logs, SQL and so) and unstructured (social media posts, sensors, multimedia) data elements. What's more, most of that data can lack indexes or other organizational structures, and may consist of many different file types.
That circumstance greatly complicates dealing with big data. The lack of consistency eliminates standard processing and storage techniques from the mix, while the operational overhead and sheer volume of data make it difficult to efficiently process using the standard server and SAN approach. In other words, big data requires something different: its own platform, and that is where Hadoop comes into the picture.
Hadoop is an open source project that offers a way to build a platform that consists of commodity hardware (servers and internal server storage) formed into a cluster that can process big data requests in parallel. On the storage side, the key component of the project is the Hadoop Distributed File System (HDFS), which has the capability to store very large files across multiple members in a cluster. HDFS works by creating multiple replicas of data blocks and distributing them across compute nodes throughout a cluster, which facilitates reliable, extremely rapid computations.
All things considered so far, it would seem that the easiest way to build an adequate storage platform for big data would be to purchase a set of commodity servers and equip each with a few terabyte-level drives and then let Hadoop do the rest. For a few smaller enterprises, it may be just as simple as that. However, once processing performance, algorithm complexity and data mining enter the picture, a commodity approach may not be sufficient to guarantee success.
The Fabric of Your Storage
It all comes down to the fabric involved and the performance of the network. For organizations frequently analyzing big data, a separate infrastructure may be warranted, simply because as the number of compute nodes in a cluster grows, so does the communication overhead. Typically, a multimode compute cluster using HDFS will create a great deal of traffic across the network backbone while processing big data. That occurs because Hadoop spreads the data (along with the compute resources) across the member servers of the cluster.
In most cases, server-based local storage is not the picture of efficiency, which is why many organizations turn to SANs that use a high-speed fabric to maximize throughput. However, the SAN approach might not lend itself well to big data implementations—especially those using Hadoop—simply because a SAN takes on the role of centralizing the data on the spindles that make up the SAN, which in turn means that each compute server will need to access the same SAN to retrieve data that would be normally distributed.
Nevertheless, when comparing local server storage to SAN-based storage for Hadoop, local storage wins in two very important ways: cost and overall performance. Simply put, raw disks without RAID placed in each compute member will collectively outperform a SAN when processing requests under HDFS. However, there is a downside to server-based disks, and that comes in the form of scalability.
The question becomes how you add more capacity when needed when the servers rely on local storage. Typically, there are two ways to handle that dilemma. The first is to add additional servers with more local storage. The second is to increase the capacity on the member servers. Both options require the purchase and provisioning of hardware, which can introduce downtime and may require a redesign of the architecture in place. Nonetheless, using either approach should prove to be significantly cheaper than adding capacity to a SAN, and that proves to be a notable benefit.
However, there are other options for storage when it comes to Hadoop. For example, several leading storage vendors are building storage appliances specifically designed for Hadoop and big data analytics. That list includes EMC, which is now offering Hadoop solutions such as the Greenplum HD Data Computing Appliance. Oracle is looking to take it one step further with the Exadata series of appliances, which offer compute power as well as high-speed storage.
Finally, another option exists for those looking to leverage big data, and that comes in the form of the cloud. Companies such as Cloudera, Microsoft, Amazon and many others are offering cloud-based big data solutions, which provide processing power, storage and support.
Making a decision about a big-data storage solution comes down to how much space is needed, how frequently analytics will be performed and what type of data is to be processed. Those factors, as well as security, budget and processing time should all be considered before investing in big data.
However, it is probably safe to say that a pilot project could be a good starting point, and commodity hardware proves to be a low-cost investment for a big-data pilot.
Frank J. Ohlhorst is a New York-based technology journalist and IT business consultant.