by Thor Olavsrud

How to Secure Big Data in Hadoop

Nov 08, 20126 mins
Big DataBusiness IntelligenceData and Information Security

The promise of big data is enormous, but it can also become an albatross around your neck if you don't make security of both your data and your infrastructure a key part of your big data project from the beginning. Here are some steps you can take to avoid big data pitfalls.

Big data promises to help organizations better understand their businesses, their customers and their environments to a degree that you could previously have only imagined.

The potential is enormous—as businesses transform into data-driven machines, the data held by your enterprise is likely to become the key to your competitive advantage. As a result, security for both your data and your infrastructure becomes more important than ever before.

Big Data Could Be Toxic Data If Lost

In the case of data that provides a competitive advantage, the need for security should be obvious. If you lose that data, or it winds up in the hands of a competitor, your advantage is lost. But worse, it could become a liability.

In many cases, organizations will wind up with what Forrester Research calls “toxic data.” For instance, imagine a wireless company that is collecting machine data—who’s logged onto which towers, how long they’re online, how much data they’re using, whether they’re moving or staying still—that can be used to provide insight to user behavior.

That same wireless company may have lots of user-generated data as well: credit card numbers, social security numbers, data on buying habits and patterns of usage—any information that a human has volunteered about their experience. The capability to correlate that data and draw inferences from it could be valuable, but it is also toxic because if that correlated data were to go outside the organization and wind up in someone else’s hands, it could be devastating both to the individual and the organization.

With Big Data, Don’t Forget Compliance and Controls

“Most of the big data projects we’ve been exposed seem kind of frenetic,” says Larry Warnock, CEO of Austin, Texas-based Gazzang, which specializes in data security solutions and operational diagnostics. “There seems to be a mad dash to access this data, and some of the old-school compliance and controls have sort of been left for phase two of the project. If you go so fast that you lose sight of basic best practices, companies may get themselves into a bit of a bind.” “Hadoop and similar NoSQL data stores enable any organization—large or small—to collect, manage and analyze immense data sets, but these nascent technologies were not necessarily designed with comprehensive security in mind,” adds Dustin Kirkland, CTO of Gazzang. “As these repositories grow in popularity and size, the potential for sensitive data to get swept up and stored is significant.”

9 Tips for Securing Big Data

Here are some specific steps you can take to secure your big data:

  1. Think about security before you start your big data project. You don’t lock your doors after you’ve already been robbed, and you shouldn’t wait for a data breach incident before you secure your data. Your IT security team and others involved in your big data project should have a serious data security discussion before installing and feeding data into your Hadoop cluster.
  2. Consider what data may get stored. If you’re planning to use Hadoop to store and run analytics against data subject to regulation, you will likely need to comply with specific security requirements. Even if the data you’re storing doesn’t fall under regulatory jurisdiction, assess your risks—including loss of good will and potential loss of revenue—if data like personally identifiable information (PII) is lost.
  3. Centralize accountability. Right now, your data probably resides in diverse organizational silos and data sets. Centralizing the accountability for data security ensures consistent policy enforcement and access control across these silos.
  4. Encrypt data both at rest and in motion. Add transparent data encryption at the file layer. SSL encryption can protect big data as it moves between nodes and applications. “File encryption addresses two attacker methods for circumventing normal application security controls,” says Adrian Lane, analyst and CTO of security research and advisory firm Securosis. “Encryption protects in case malicious users or administrators gain access to data nodes and directly inspect files, and it also renders stolen files or disk images unreadable. It is transparent to both Hadoop and calling applications and scales out as the cluster grows. This is a cost-effective way to address several data security threats.”
  5. Separate your keys and your encrypted data. Storing your encryption keys on the same server as your encrypted data is similar to locking your front door and then leaving the keys dangling from the lock. A key management system allows you to store your encryption keys safely and separately from the data you’re trying to protect.
  6. Use the Kerberos network authentication protocol. You need to be able to govern which people and processes can access data stored within Hadoop. “This is an effective method for keeping rogue nodes and applications off your cluster,” Lane says. “And it can help protect web console access, making administrative functions harder to compromise. We know Kerberos is a pain to set up, and (re-)validation of new nodes and applications take work. But without bi-directional trust establishment, it is too easy to fool Hadoop into letting malicious applications into the cluster, or into accepting the introduction of malicious nodes—-which can then add, alter or extract data. Kerberos is one of the most effective security controls at your disposal, and it’s built into the Hadoop infrastructure, so use it.”
  7. Use secure automation. You’re dealing with a multi-node environment, so deployment consistency can be difficult to ensure. Automation tools like Chef and Puppet can help you stay on top of patching, application configuration, updating the Hadoop stack, collecting trusted machine images, certificates and platform discrepancies. “Building the scripts takes some time up front but pays for itself in reduced management time later, and additionally ensures that each node comes up with baseline security in place.”
  8. Add logging to your cluster. “Big data is a natural fit for collecting and managing log data,” Lane says. “Many web companies started with big data specifically to manage log files. Why not add logging onto your existing cluster? It gives you a place to look when something fails, or if someone thinks perhaps you’ve been hacked. Without an event trace you are blind. Logging MR requests and other cluster activity is easy to do and increases storage and processing demands by a small fraction, but the data is indispensable when you need it.”
  9. Implement secure communication between nodes and between nodes and applications. To do this, you’ll need an SSL/TLS implementation that protects all network communications rather than just a subset. Some Hadoop providers, like Cloudera, already do this, as do many cloud providers. If your setup doesn’t have this capability, you’ll need to integrate the services into your application stack.

Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for Follow Thor on Twitter @ThorOlavsrud. Follow everything from on Twitter @CIOonline and on Facebook. Email Thor at