As enterprises seek to move into the big data world–digitizing paper documents and saving email communications, Word docs, Excel files and all sorts of other unstructured data with the hopes of mining them for actionable business intelligence–they need to address a big problem up front: storage.
“Enterprises have suddenly accumulated petabytes of information,” says Nick Kirsch, director of product management for EMC Isilon. “They’re faced with a similar challenge: They’ve got all this information, how do they make use of it and how do they store it in a scalable architecture?”
One possibility is to scale vertically (scale up). The idea is to make your existing storage nodes larger, faster and/or more powerful by replacing your existing storage devices with new, higher-capacity devices. Consolidating storage infrastructure in such a way is attractive, since it simplifies management and reduces the amount of floor space and power consumed. But it’s not without problems: It can’t span multiple locations easily, it doesn’t have much inherent overall resiliency and large, high-performance storage devices can get expensive in a hurry. And when dealing with the ever-increasing flood of information, the biggest problem is that today’s storage devices can get only so big.
“You can build a bigger and bigger single unit controller,” says Kirsch. “But at some point you can’t build that system any bigger; you have to add a second system. You could end up with hundreds of separate units you need to manage.”
Instead, Kirsch says scaling horizontally (scale out) with NAS is the way to go. A scale-out NAS architecture forgoes expensive, high-capacity storage devices for commodity storage components combined into an aggregate storage pool. Instead of making nodes bigger, you add nodes as necessary. The downside is that you can very quickly wind up with a much more complex management environment. But it can span multiple locations and it has a great deal of inherent resiliency. And, perhaps most important from the perspective of managing big data, you can add storage rapidly and cheaply.
“I think the biggest thing that we see, the biggest complaint when it comes to storage is that it’s really easy to manage a single unit, but when you have two or more units it becomes complicated,” Kirsch says.
For big data, NAS is preferable to SAN, Kirsch says, because SAN is not built for unstructured data and file sharing. In order to use SAN with network protocols like NFS or CIFs/SMB, you would have to deploy file servers in front of the SAN, resulting in additional management complexity and affecting scalability.
The Five Tenets of Scale-Out NAS
Simplicity comes first in Kirsch’s five tenets of what CIOs should look for in scale-out NAS architecture:
- Simple to scale. “This next generation architecture that they’re looking to move to needs to be simple to scale,” Kirsch says. “If I have a 1TB drive, that’s a volume that I can manage, I can protect and I can replicate. Why can’t I manage 15 petabytes with that same simplicity? It shouldn’t be more complicated just because it’s bigger.” Scale-out NAS architectures can tackle this problem with software management and a virtualization/abstraction layer that makes the nodes behave like a single system.
- Predictable. “The performance needs to be predictable,” Kirsch says. If I add 6TB this week and 6TB next week, I want that same linear scalability in terms of performance. I don’t want to have to re-architect my application or re-educate my users. It should just scale in a predictable fashion. I want it to be pay as you grow. Don’t make me overinvest today. I know that Moore’s Law is going to give me faster computing next month and that drives are going to get denser over time. Let me take advantage of that in my storage infrastructure. And please, let this be shared symmetric architecture. Don’t force me to understand differences in your architecture. Allow me to scale this system as I need it.”
- Efficient. “Let me leverage all the resources in my storage system, regardless of where they are,” Kirsch says. “Let me get great utilization out of my physical disk drives, not 50 or 55 percent, but over 80 percent of that storage should be utilized for my data. Regardless of where the CPU is or the compute or the cache, let me take advantage of that. Whether the application over here is hot or the application over there, I want the storage system to maximize the performance of that application. And please, integrate tiering into this system.” In other words, you should have to move data around to optimize performance or optimize capacity. Scale-out NAS for big data needs to be intelligent enough to automate that for you.
- Available. “This has to be available all the time,” Kirsch says. “Take advantage of an N-way architecture. Allow me to survive more than two failures. Allow me to survive when a rack goes down in my environment. I want this to be on all the time. And let it be flexible. Let me align the availability of the protection of the system with the needs of my business units. If they’re willing to invest more, I can give them greater availability. If the data is less valuable, I can give them less availability.” Boiled down, since a scale-out NAS storage infrastructure is built on commodity hardware, there’s an assumption that hardware will fail and the system has to be designed to deal with a higher rate of hardware failure.
- Enterprise-proven. “As the technology has matured, it’s no longer this side project that’s outside of IT,” Kirsch says. “It’s a key part of IT. It’s got to have snapshots, replication, quotas and all the other traditional IT features. This technology really evolved out of an HPC root, but if you’re going to build a scale-out system, ultimately you’ve got to build it to fit into an enterprise environment.”
Thor Olavsrud is a senior writer for CIO.com. Follow him @ThorOlavsrud.