At Dell SecureWorks, big data is getting bigger all the time. We process more than 185 billion logs per day. Our data volumes are growing 66% every year. We are just a few years away from processing trillions of logs per day and needing to store hundreds of petabytes of data. These enormous and ever-growing data volumes were one of the drivers for our move to the Dell | Cloudera Hadoop big data platform.
In this first post in a two-part blog series, I will summarize our journey to Hadoop, in the hope of providing insight to others who are exploring or starting down the same path. But, first, let’s start with a little background on Dell SecureWorks, for those who are not familiar with what we do.
Dell SecureWorks is the world leader in cyber security solutions, operating in 61 countries. Among other services, we provide an early warning system for evolving cyber threats; this enables clients to prevent, detect, rapidly respond to, and predict cyber-attacks. Our security monitoring solution uses data gathered from client devices to detect threat activity within their environments. Our researchers use the data we collect in Hadoop to understand the threat actor’s tactics, to try and determine the motives, to develop counter measures, and to update our intelligence.
Now let’s get to our journey to Hadoop. When we were starting out, like many companies, we chose a traditional RDBMS for our data needs. Our data volumes were low enough that initially this worked fine. Within a few years, however, we quickly started to hit the limits of this technology. It became very challenging and expensive to get the performance we required. We knew we needed a different solution but in the mid-2000s there really was not much to choose from. We chose to build our own solution modeled after the concepts contained in Google’s white paper titled “Bigtable: A Distributed Storage System for Structured Data.”
The technology started off very simple, but over time developed into a sophisticated big data solution with advanced features, including high-performance writes, data storage optimized for query performance and disk space, a SQL-like query interface, and even high availability. This technology was working well at the time, but in 2009, as we looked toward a 10x, 50x, and 100x future, we decided to migrate to an industry-supported stack rather than continue to invest in our proprietary system.
One year later we began our journey with Cloudera Hadoop. The Hadoop platform offered the extreme scalability that we needed, and it gave us the confidence that comes with a proven technology backed by many large enterprises. Hadoop has the added advantage that it is maintained by a large community of developers, many of whom are solving very similar problems.
Today, we collect log data from our clients and extract normalized events from the log data. Both the log data and the normalized events are sent back to our data center using proprietary technology found in our counter-threat appliance, and then stored in our 3 PB Hadoop cluster. We then use a series of MapReduce and Spark jobs to aggregate the data for faster querying, to enrich the data, and to perform incident detection. Once the data is at rest we use a combination of MapReduce, Hive, Impala, and Spark to correlate related data, generate models for our machine learning systems, generate analytics, further enrich the data, and perform forensic analysis. Hadoop has given us a rich set of tools capable of tackling almost any problem we can imagine, and it can do this at extreme scales.
Looking ahead, we are constantly assessing the state of the Hadoop ecosystem and considering whether to adopt any of the newest technologies. We are actively looking at Kafka for messaging, Spark Streaming for stream processing, HBase to improve our enrichment capabilities and certain query response times, and Spark MLib for our machine learning systems. The great thing about Hadoop is the ecosystem is constantly improving and introducing exciting new solutions to address the latest challenges.
In a follow-on post, I will explore five key lessons from our Hadoop journey. These lessons cover some things to keep in mind should you embark on your own Hadoop journey. In the meantime, you can learn more about the Hadoop environment at Dell SecureWorks by reading the case study “Helping customers stay secure.”
Jim Birmingham is director of engineering at Dell SecureWorks. He is responsible for event processing, big data analytics, cloud security, health monitoring, machine learning systems, and data science.
©2016 Dell Inc. All rights reserved. Dell, the DELL logo, the DELL badge and PowerEdge are trademarks of Dell Inc. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.