Today's top stories

10 hot data analytics trends — and 5 going cold

Big data, machine learning, data science — the data analytics revolution is evolving rapidly. Keep your BA/BI pros and data scientists ahead of the curve with the latest technologies and strategies for data analysis.

stats big data stocks analytics
Thinkstock

Data analytics are fast becoming the lifeblood of IT. Big data, machine learning, deep learning, data science — the range of technologies and techniques for analyzing vast volumes of data is expanding at a rapid pace. To gain deep insights into customer behavior, systems performance, and new revenue opportunities, your data analytics strategy will benefit greatly from being on top of the latest data analytics trends.

Here is a look at the data analytics technologies, techniques and strategies that are heating up and the once-hot data analytics trends that are beginning to cool. From business analysts to data scientists, everyone who works with data is being impacted by the data analytics revolution. If your organization is looking to leverage data analytics for actionable intelligence, the following heat index of data analytics trends should be your guide.

Heating up: Self-service BI

Who: BI/BA Pros, Managers

With self-service BI tools, such as Tableau, Qlik Sense, Power BI, and Domo, managers can obtain current business information in graphical form on demand. While a certain amount of setup by IT may be needed at the outset and when adding a data source, most of the work in cleaning data and creating analyses can be done by business analysts, and the analyses can update automatically from the latest data any time they are opened.

Managers can then interact with the analyses graphically to identify issues that need to be addressed. In a BI-generated dashboard or “story” about sales numbers, that might mean drilling down to find underperforming stores, salespeople, and products, or discovering trends in year-over-year same-store comparisons. These discoveries might in turn guide decisions about future stocking levels, product sales and promotions, and even the building of additional stores in under-served areas.

[ Find out the hottest data and analytics trends today and the big data certifications that will pay off. | Get the latest insights by signing up for our CIO newsletter. ]

Heating up: Mobile dashboards

Who: BI/BA Pros, Managers, Developers

In a world where managers are rarely at their desks, management tools need to present mobile-friendly dashboards to be useful and timely. Most self-service BI tools already have this feature, but not every key business metric necessarily goes through a BI tool.

For example, a manufacturing plant is likely to have a dedicated QA system monitoring all production lines. All plant managers need to know whether any of the lines have drifted out of tolerance within minutes of the event; that’s easily done with an app that queries the QA database every minute, updates and displays a Shewhart control chart, and optionally sounds an alarm when a line goes out of spec.

Cooling down: Hadoop

Who: Data scientists

Hadoop once seemed like the answer to the question “How should I store and process really big data?” Now it seems more like the answer to the question “How many moving parts can you cram into a system before it becomes impossible to maintain?”

The Apache Hadoop project includes four modules: Hadoop Common (utilities), Hadoop Distributed File System (HDFS), Hadoop YARN (scheduler) and Hadoop MapReduce (parallel processing). On top of or instead of these, people often use one or more of the related projects: Ambari (cluster management), Avro (data serialization), Cassandra (multi-master database), Chukwa (data collection), HBase (distributed database), Hive (data warehouse), Mahout (ML and data mining), Pig (execution framework), Spark (compute engine), Tez (data-flow programming framework intended to replace MapReduce), and ZooKeeper (coordination service).

If that isn’t complicated enough, factor in Apache Storm (stream processing) and Kafka (message transfer). Now consider the value added by vendors: Amazon (Elastic Map Reduce), Cloudera, Hortonworks, Microsoft (HDInsight), MapR, and SAP Altiscale. Confused yet?

Heating up: R language

Who: Data scientists with strong statistics

Data scientists have a number of option to analyze data using statistical methods. One of the most convenient and powerful methods is to use the free R programming language. R is one of the best ways to create reproducible, high-quality analysis, since unlike a spreadsheet, R scripts can be audited and re-run easily. The R language and its package repositories provide a wide range of statistical techniques, data manipulation and plotting, to the point that if a technique exists, it is probably implemented in an R package. R is almost as strong in its support for machine learning, although it may not be the first choice for deep neural networks, which require higher-performance computing than R currently delivers.

R is available as free open source, and is embedded into dozens of commercial products, including Microsoft Azure Machine Learning Studio and SQL Server 2016.

Heating up: Deep neural networks

Who: Data scientists

Some of the most powerful deep learning algorithms are deep neural networks (DNNs), which are neural networks constructed from many layers (hence the term "deep") of alternating linear and nonlinear processing units, and are trained using large-scale algorithms and massive amounts of training data. A deep neural network might have 10 to 20 hidden layers, whereas a typical neural network may have only a few.

The more layers in the network, the more characteristics it can recognize. Unfortunately, the more layers in the network, the longer it will take to calculate, and the harder it will be to train. Packages for creating deep neural networks include Caffe, Microsoft Cognitive Toolkit, MXNet, Neon, TensorFlow, Theano, and Torch.

Cooling down: IoT

Who: BI/BA pros, data scientists

The Internet of Things (IoT) may be the most-hyped set of technologies, ever. It may also be the worst thing that happened to Internet security, ever.

IoT has been touted for smart homes, wearables, smart cities, smart grids, industrial internet, connected vehicles, connected health, smart retail, agriculture, and a host of other scenarios. Many of these applications would make sense if the implementation was secure, but by and large that hasn’t happened.

In fact, the manufacturers have often made fundamental design errors. In some cases, the smart devices only work if they are connected to the Internet and can reach the manufacturers’ servers. That becomes a significant point of failure when the manufacturer ends product support, as happened with the Sony Dash and the early Nest thermometer. Including a remote Internet-connected server into a control loop also introduces a significant and variable lag into the control loop which can introduce instability.

Even worse, in their rush to connect their “things” to the Internet, manufacturers have exposed vulnerabilities that have been exploited by hackers. Automobiles have been taken over remotely, home routers have been enlisted into a botnet for carrying out DDoS attacks, the public power grid has been brought down in some areas…

What will it take to make IoT devices secure? Why aren’t the manufacturers paying attention?

Until security is addressed, the data analytics promise of IoT will be more risk than reward.

Heating up: TensorFlow

Who: Data scientists

TensorFlow is Google’s open source machine learning and neural network library, and it underpins most if not all of Google’s applied machine learning services. The Translate, Maps, and Google apps all use TensorFlow-based neural networks running on our smartphones. TensorFlow models are behind the applied machine learning APIs for Google Cloud Natural Language, Speech, Translate, and Vision.

Data scientists can use TensorFlow, once they can get over the considerable barriers to learning the framework. TensorFlow boasts deep flexibility, true portability, the ability to connect research and production, auto-differentiation of variables, and the ability to maximize performance by prioritizing GPUs over CPUs. Point your data scientists toward my tutorial or have them look into the simplified Tensor2Tensor library to get started.

Heating up: MXNet

Who: Data scientists

MXNet (pronounced “mix-net”) is a deep learning framework similar to TensorFlow. It lacks the visual debugging available for TensorFlow but offers an imperative language for tensor calculations that TensorFlow lacks. The MXNet platform automatically parallelizes symbolic and imperative operations on the fly, and a graph optimization layer on top of its scheduler makes symbolic execution fast and memory efficient.

MXNet currently supports building and training models in Python, R, Scala, Julia, and C++; trained MXNet models can also be used for prediction in Matlab and JavaScript. No matter what language you use for building your model, MXNet calls an optimized C++ back-end engine.

Cooling down: Batch analysis

Who: BI/BA pros, data scientists

Running batch jobs overnight to analyze data is what we did in the 1970s, when the data lived on 9-track tapes and “the mainframe” switched to batch mode for third shift. In 2017, there is no good reason to settle for day-old data.

In some cases, one or more legacy systems (which may date back to the 1960s in some cases) can only run analyses or back up their data at night when not otherwise in use. In other cases there is no technical reason to run batch analysis, but “that’s how we’ve always done it.”

You’re better than that, and your management deserves up-to-the-minute data analysis.

Heating up: Microsoft Cognitive Toolkit 2.0

Who: Data scientists

The Microsoft Cognitive Toolkit, also known as CNTK 2.0, is a unified deep-learning toolkit that describes neural networks as a series of computational steps via a directed graph. It has many similarities to TensorFlow and MXNet, although Microsoft claims that CNTK is faster than TensorFlow especially for recurrent networks, has inference support that is easier to integrate in applications, and has efficient built-in data readers that also support distributed learning.

There are currently about 60 samples in the Model Gallery, including most of the contest-winning models of the last decade. The Cognitive Toolkit is the underlying technology for Microsoft Cortana, Skype live translation, Bing, and some Xbox features.

Heating up: Scikit-learn

Who: Data scientists

Scikits are Python-based scientific toolboxes built around SciPy, the Python library for scientific computing. Scikit-learn is an open source project focused on machine learning that is careful about avoiding scope creep and jumping on unproven algorithms. On the other hand, it has quite a nice selection of solid algorithms, and it uses Cython (the Python to C compiler) for functions that need to be fast, such as inner loops.

Among the areas Scikit-learn does not cover are deep learning, reinforcement learning, graphical models, and sequence prediction. It is defined as being in and for Python, so it doesn’t have APIs for other languages. Scikit-learn doesn’t support PyPy, the fast just-in-time compiling Python implementation, nor does it support GPU acceleration, which aside from neural networks, Scikit-learn has little need for.

Scikit-learn earns the highest marks for ease of development among all the machine learning frameworks I’ve tested. The algorithms work as advertised and documented, the APIs are consistent and well-designed, and there are few “impedance mismatches” between data structures. It’s a pleasure to work with a library in which features have been thoroughly fleshed out and bugs thoroughly flushed out.

Cooling down: Caffe

Who: Data scientists

The once-promising Caffe deep learning project, originally a strong framework for image classification, seems to be stalling. While the framework has strong convolutional networks for image recognition, good support for CUDA GPUs, and decent portability, its models often need excessively large amounts of GPU memory, the software has year-old bugs that haven’t been fixed, and its documentation is problematic at best.

Caffe finally reached its 1.0 release mark in April 2017 after more than a year of struggling through buggy release candidates. And yet, as of July 2017, it has over 500 open issues. An outsider might get the impression that the project stalled while the deep learning community moved on to TensorFlow, CNTK and MXNet.

Heating up: Jupyter Notebooks

Who: Data scientists

The Jupyter Notebook, originally called IPython Notebook, is an open-source web application that allows data scientists to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

Jupyter Notebooks have become the preferred development environment of many data scientists and ML researchers. They are standard components on Azure, Databricks, and other online services that include machine learning and big data, and you can also run them locally. “Jupyter” is a loose acronym meaning Julia, Python, and R, three of the popular languages for data analysis and the first targets for Notebook kernels, but these days there are Jupyter kernels for about 80 languages.

Heating up: Cloud storage and analysis

Who: BI/BA pros, data scientists

1 2 Page 1
NEW! Download the State of the CIO 2017 report