What is big data? A quick Google search reveals it has many definitions – mostly including a bunch of “V”s: the volume (amount) of data, the velocity (speed) at which it is collected, and the variety of the information. Unfortunately, these definitions are ambiguous, and they can lead to more questions than they answer.
I generally prefer to think of big data as anything that can’t be solved with known technologies – an expanding universe of componentry, data and services that can be assembled in countless ways to solve problems. However, what really is most important is to identify what big data means to you. To help you answer this question, let’s take a look at an essential part of the enterprise big data universe.
You do have use cases, don’t you?
Use cases are the reason that drives the “why” of big data. A best practice is to cultivate your use cases, do triage against them, rank them by your order of preference, and then solution them. Just a few of the many common use cases might be
- Customer 360
- Next best offer
- Customer defection
- Fraud detection
Each of these use cases may be solutioned using a variety of different methods.
Big data solution architectures
There are a very large number of components in the big data universe, and there are also lots of options to assemble them. Some ways are good, and others are bad, with many variations in between. Often, combinations of components will favor certain use cases over others by assembling solution architectures that solve current problems, minimize cost, expedite time to market and/or are pliable to future unknown needs. As result, a unique combination of technical skills, design, requirements, tradeoffs and many other factors is needed to come up with a custom solution architecture design.
Hadoop, an open-source software framework for distributed storage and processing of big data that allows for massively parallel computing, is at the center of many big data solution architectures. It is a Swiss army knife of perhaps 50 components that can be assembled in innumerable ways. These components include
- Hadoop Distributed File System (HDFS) – gives persistence and sequential access
- Apache Hive – a SQLish query language that allows queries, albeit using batch map-reduce underneath
- Apache Pig – allows for construction of data flows with read from and write to a variety of sources, like HDFS and the HBase NoSQL open source database
What are NoSQL databases?
NoSQL databases are targeted for distributed scale-out, along with a flexible, non-static schema. Generally, their loose consistency model allows for bodacious amounts of reads and writes across many distributed nodes. Data resiliency is achieved by replicating data across multiple nodes, typically three.
Most NoSQL databases are open source. While they do sacrifice many features as compared with relational database management systems (RDBMS) – things like joins, complex queries and management tools – they are important because they provide random access to key subsets of what is known as “polyglot” data. A polyglot database is used to solve complex big data problems by breaking them into segments and applying different database models before aggregating the results, and this type of hybrid environment is often used as part of a data pipeline.
NoSQL databases follow a taxonomy of four basic types. These types, along with some examples, are
- Key Value – Redis, Riak
- Wide Column – Hbase, Cassandra, Azure Cosmos DB, DataStax Enterprise
- Document – MongoDB, Couchbase, DynamoDB, Azure Cosmos DB, Marklogic
- Graph – Neo4j, Tiger Graph, Datastax Graph (formerly Titan)
All these databases can work with a data model known as “key value,” which is different from those used by default in relational databases and can make some operations faster in NoSQL. Some of the basic taxonomy types also work with multiple data models. Hence, they are termed “multi-modal.” (Beware of vendors that may attempt to cross over to multi-modal models when they don’t have a product that fits.)
There are hundreds of NoSQL databases out there, and most are open source. However, some products die, and others are born, practically every week. A really good list of 225+ NoSQL databases and counting may be found at NOSQL: Your Ultimate Guide to the Non-Relational Universe! The cool thing about this site is it gives you a category, what it’s written in, interfaces, links and more.
Putting it all together
As an example, I’ll build on a use case I implemented during a previous life. The need is to build a Customer 360 view, and sources include a myriad of databases, such as Pivotal’s Greenplum Database, PostgreSQL, IBM’s DB2 Database, Oracle’s Database and Microsoft SQL Server. In addition,
- The targeted persistence (group of files used to communicate between the application and database) is in three different data stores.
- Ingested, non-transformed data is stored in HDFS.
- A different directory is used for each load of each source.
So, five sources, loaded for seven days, gives us 35 directories.
The next level of persistence is HBase, which takes a subset of each source and loads it into HBase for queries, downstream processing and other analytics.
The final form of persistence is actually egesting a householded, matched, quality, enhanced customer record into a relational enterprise data warehouse (EDW) – Oracle in this case.
- Apache Sqoop is used to periodically read data from an RDBMS source and land it in HDFS.
- Apache Hive queries are run for some reporting.
Ultimately, most end users will consume the data via standard tools, like Tableau and IBM Cognos, from the EDW. Only power users have access to part of the Hadoop ecosystem.
Challenges in big data come in many ways and forms. Take your most pressing challenges and brainstorm solutions using Hadoop, NoSQL, Apache Kafka and more. Build it, test it, implement it. Rinse and repeat for the next use cases.
To learn more
For more perspectives on managing and harnessing the value of big data analytics, visit Dell Technologies’ Ready Solutions for Data Analytics page and join the Dell EMC Everything Big Data Community.