by Thor Olavsrud

8 tips to get more bang for your big data convergence bucks

Feature
Aug 04, 2016
AnalyticsBig DataBusiness Intelligence

CIOs can achieve cost savings and productivity gains from the convergence of development, IT ops and BI strategy.

0 intro hp title
Credit: Thinkstock

How technology convergence can help CIOs do more with less

CIOs and other IT decision-makers are used to having to do more with less. In the world of big data, they may be able to achieve orders-of-magnitude cost savings and productivity gains due to the convergence of development, IT ops and business intelligence (BI) strategy, exploiting advancements in open source software, distributed computing, cloud economics and microservices development.

“CIOs have been told to ‘do more with less’ so often that is has become part of their DNA, and not really open to much debate,” says Jack Norris, senior vice president of Data and Applications, MapR Technologies. “So instead, let’s consider the next best thing: getting twice the bang for your buck by taking advantage of converging technologies and skill sets; in other words, getting your data and applications platform to do double or triple duty in order to reduce costs, complexity and effort.”

Norris offers eight tips to help you identify how technology convergence can help.

Recycle enterprise protocols

1 recycle ent protocols

Image by Thinkstock

Yes, new tools techniques and APIs will inevitably be part of your plans, but Norris says CIOs and enterprise architects should make sure to seek out linkages between new approaches and established enterprise standards like SQL, NFS, LDAP and POSIX.

“You’ve paid for the expertise and these standards have been burning for decades,” he says. “It’s not time to toss them out for the shiny new thing until it’s blindingly obvious that you should do so. Chances are, there is either an Apache project or an enterprising software vendor who can help you bridge the old and new worlds.”

Spark and Hadoop, together apart

2 hadoop and spark

Image by Thinkstock

Apache Hadoop helped jumpstart the revolution in modern big data analytics, but Apache Spark has begun stealing the show when it comes to powering today’s data-driven applications.

“Developed well after Hadoop, Spark can be run on top of Hadoop, but it can also be run as a standalone cluster,” Norris says. “Spark is now the preferred development platform over Hadoop’s MapReduce model, but the data management capabilities of Hadoop may convince you to keep the two together. Whatever your choice, protection of the data is paramount. Applications can be restarted, but lost or corrupted data is just lost.”

Avoid cluster sprawl

3 avoid cluster sprawl

Image by Thinkstock

The IT function is no stranger to computing clusters, but today’s environment can easily lead to “clusters of clusters.” Spark and Hadoop are often deployed in separate clusters. Kafka streaming, clustered file systems to manage files, a Node.js front end and more all can lead to cluster sprawl.

“Scale-out clustering is arguably one of the fundamental underpinnings of big data,” Norris says. “But each cluster may have its own security model, administrative interface, data format, rules for persistence, and, oh yeah, separate hardware! This can quickly lead you back to the technology silos you are trying to avoid. Look to implementations that enable you to consolidate or converge clusters into a single platform, or at least the minimum number of platforms.”

Data warehouse on a lake

4 data warehouse on a lake

Image by Thinkstock

Despite some suggestions to the contrary, the data warehouse isn’t dead, but data lakes have become an attractive alternative — often the first, most common big data use case a given IT organization tackles.

“One of the first benefits that customers realize from data lakes is simply better visibility into what the company ‘knows,'” Norris says. “The immediate windfall of that visibility is a more complete and nuanced customer 360 model. This often translates into either better, more informed marketing and selling, or a more accurate model for predicting and preventing fraud, waste or abuse.”

Consider HTAP

5 htap

Image by Thinkstock

Hybrid Transaction/Analytical Processing (HTAP) is a term coined by research firm Gartner to refer to next-generation data platforms capable of both online transaction processing (OLTP) and online analytical processing (OLAP) without the need for data duplication.

“Hadoop and new analytics are already laying siege to data warehouses and are even starting to replace relational databases for some types of transactional workloads,” Norris says. “Some organizations find the path to HTAP is through using document database technology, which enables OLTP and OLAP operations without a costly data transformation step. Don’t weep for Oracle just yet, but the logical and physical separation of OLTP and [data warehouse] workloads will continue to be challenged and eroded by new data management and analytical methods as time passes.”

Event streams as a system of record

6 record keeping

Image by Thinkstock

With the demand for data in motion continuing to expand each day, organizations or increasingly focusing on event streams.

“A lot of the talk sensibly centers around streaming analytics, triggers and alerting and complex event processing (CEP),” Norris says. “But some companies are beginning to look at data streams as a way to capture a time-stamped record of data interactions between systems and companies. However, for the slightly less sexy topics of data provenance, lineage, persistence and lifecycle, creating an immutable record of all data interactions can be highly value.”

Hybrid coud

7 hyrbrid cloud

Image by Thinkstock

Hybrid clouds have been around for years at this point, but the concept is taking on even more significance in a big data world.

“One of the basic tenets of Hadoop and distributed computing is the notion of moving the compute to the data, rather than the reverse,” Norris says. “The sheer volume of data now being collected is enough of a reason, but another is the growing preponderance of commercial data sources and the increasing chance that companies will rely on external sources of data for their analytics and applications. This suggests that you look for data and application platforms that can operate cooperatively both in the cloud and behind the firewall.”

Analyzing in place

8 analyzing in place

Image by Thinkstock

Transforming and moving data often requires an enormous amount of time and effort. There are situations, Norris says, where you can cut out that time and cost.

“Using Spark, Apache Drill or other in-memory processing technologies provides an opportunity to avoid data movement, ETL operations and other data transformations while still exploiting the schema-on-read approach to analytics that is a hallmark of the Hadoop platform,” he says. “Note that as always, there are network and/or disk latencies that come into play when reading data into memory. However, if you have invested in an enterprise-grade distributed file system, it is another interesting weapon in your analytics arsenal.”