Big data is a term that is used in the information technology industry to mean building multiple sources of information together into a data lake, a data repository built on relatively inexpensive high performing computer hardware. The value of your data can be extracted from a data lake through existing reporting and business analytic systems. Furthermore, the advent of machine learning capabilities for big data solutions provides additional analytical capabilities. Machine learning has derived meaningful insights which can be used to support business development and organizational growth.
Machine learning and big data
Big data in its current form will reduce your operational and infrastructure costs, but will not provide you with any additional value for your business over what enterprise data warehouses provides. Why is that? The machine learning of today, that is employed within big data solutions, has no more capability than the statistical packages that are already in use within enterprise data warehouses solutions.
This may be true for today’s current big data incarnation. However, the future holds a new set of machine learning tools that integrate with big data. These new machine learning tools fall under the heading of neural networks. It might be helpful to first understand what neural networks can and cannot do.
Neural network capabilities
First and foremost, neural networks cannot think! More about that later. Neural networks have the capability for classification, regression analytics and forecasting. Given those capabilities, here is a small but amazing sample of uses that neural networks excel at:
- Object and image recognition
- Facial recognition
- Speech and video recognition
- Natural language processing
- Sentiment analysis
- Medical & radiology diagnosis
- Drug discoveries
- Finance trading and long-term investing
- Digital advertising
- Driverless cars
- Remote robots
- Marketing and sales (customer intel)
- Agriculture & environmental conditions
- Fraud detection, regulation compliance and adherence
There are more spectacular uses, but the above list will give you some idea that big data platforms are only in their infancy.
It is important to understand that not all neural networks are created equally. Picking a neural network that doesn't align to the specific problem that your trying to solve will result in poor accuracy and performance.
To get a sense of how different neural networks are used, below is a small sampling of uses and the neural networks that best fit the problem space and how they align to their capabilities. Note: It is beyond the scope of this article to go into detail about how each of these networks works.
- Image recognition will use a deep belief network (DBN) or a convolutional neural network (CNN).
- Speech recognition or time lapse problems like driverless cars will use a recurrent neural network (RNN).
- Natural language process, sentiment analysis and named identify recognition will use a recursive neural tensor network (RNTN) or a recurrent network.
- Object recognition will use a CNN or a RNTN.
Training neural networks
Neural networks need to be trained. The process for training a neural network is called “back propagation.” Back propagation takes a lot of time to train a network using conventional CPUs. This is why the neural network community has turned to using graphical processor units (GPU), as they are 250 times faster in training a neural network. That's the difference between one day of training and over eight months using conventional CPUs. Who would have thought that legacy mainframes were going to be replaced by Xboxes and PlayStations!
Neural networks work best at identifying patterns. If you were to train a neural network to identify things like wolf, dog, cat and cow, would you be able to see any relationship patterns? Specifically, would neural networks be able to identify patterns that wolf is a wild animal, while dog, cat and cow are domesticated? Or that wolf, dog and cat are carnivorous and the cow is herbivorous? It would, if your neural network is trained for these patterns. These types of relationships are known as knowledge representation.
This is why neural networks don’t think, they fundamentally lack the concepts to identify the patterns that are being trained for. To train a network well, you need to provide just the right amount of information to generalize the pattern to the concept that you are trying to learn. Provide too little training information, the pattern will not get identified. Provide too much training information, you will have poor performance and you will miss the pattern because there is not enough input data to meet the training set specification.
Finding the “Goldilocks” of your model is where ontologies come into play. Ontologies will not only help identify all the relevant patterns of relationships to your business concepts, they also provide the means for validating your model and inferring new relationships. As in the above example of grouping the animals into different concepts, it is the relationships that make your data smart.
In today’s global connected world, does your company have well-defined business patterns describing how your business concepts are being used today? In my earlier post, "Don't do what I say, do what I mean," I articulated how the banking industry is using ontologies to develop knowledge models to describe business concepts and features to mitigate global risks. This is why the banking industry has co-developed with the Enterprise Data Management Group (EDM) a financial industry business ontology (FIBO). Once their business concepts have been generalized to features, then they will be able to use neural networks to go through large amounts of data identifying any potential risk exposures.
This article has described advanced computer science concepts that cannot be simply mastered in a week by reading a few web tutorials. If your company is not committing real dollars in R&D and training, then your company is simply prescribing a road map of irrelevance for your shareholders and customers.
Building big data solutions with these advanced capabilities will guarantee that you have the advantage to manage any new threats in an expeditious manner and provide the foundation to leapfrog your competition.
This article is published as part of the IDG Contributor Network. Want to Join?