To make decisions more quickly and accurately, enterprises are increasingly turning to machine learning, arguably today’s most practical application of AI. Machine learning systems apply algorithms to data to glean insights into that data without explicit programming: It’s about using data to answer questions. As such, companies are applying machine learning to a wide array of issues, from customer purchasing patterns to predictive maintenance.
But before a machine learning system can answer questions, it must first be trained on data and outcomes. That’s because, while not explicitly programmed, machine learning systems need to develop and hone their ability to make predictions from data through experience with the same kind of data it will use to answer questions. For example, to predict whether a component is about to fail, a machine learning system must first be trained to do so by being fed sets of sensor readings from both functional and failing components.
This apparently prosaic stage between choosing your machine learning algorithm and deploying your data model is actually a key step in getting machine learning right: Get it wrong and you’ll end up with a system that doesn’t deliver what you want. There are some common mistakes that often happen when training machine learning systems; there are also decisions that need to be made early on, long before a machine learning system is deployed, that will be challenging and costly to address later.
Here’s what to look out for.
Ensure data quality
Before you can set about training your machine learning system, you have to prep your data.
“Getting the data straight is hugely time consuming and often underestimated: setting up a modern data infrastructure, identifying what data you can ingest or generate and doing cleanup can take a long time,” says Eric Gardner, director of sales enabling in Intel’s AI Products Group.
This includes dealing with duplicate, corrupt and missing data to ensure the data you will be feeding your system during is of good quality and relevant to the task at hand.
“Before getting started, CIOs and IT leaders should involve business stakeholders early to determine the business problem to solve,” says Eric Boyd, corporate vice president for Microsoft’s AI Platform. That way data scientists can design the right experiments and concentrate on the data needed for those experiments.
“When teams take the opposite tack — starting with a lot of data and then seeing what kind of new insights result by applying ML — their approach rarely returns viable results,” he warns. “It’s not about big data; it’s about the right type of data.”
Adjust for potential bias
In addition to pre-processing your data, you also need to know whether there are outliers or biases in your data and how real-world data may differ from your training data. Bias, which can result in systems trained on faulty assumption, should in particular be considered early on, because machine learning can both amplify any bias present in your training set or even introduce it.
“Quality, bias and risks/integrity threats need governance and knowledge about how the data will be used from the outset,” Boyd warns. “Model interpretability and fairness tools such as fairlearn and LIME, along with established corporate governance practices, can help IT leaders manage their ML training projects and accelerate the model validation process, saving time and adjusting for potential bias.” Knowing where your data comes from is critical to this.
Involve domain experts
To train many machine learning systems, training data must be labelled. Here, human judgment comes into play for picking the right label and the right examples of that label.
For supervised machine learning, getting the labels right on your training data is key, so make sure domain experts are involved early on. You will also need expert knowledge for emerging machine learning tools such as causal models, which use Bayesian techniques to try to determine which variables represent causes and which represent effects.
“Traditional models tend to focus on correlation without considering causality, so while they may identify a relationship between variables they won’t define how significantly they influence each other,” says Sam Bourton, CTO of QuantumBlack. Spurious correlations are common: A non-causal model aimed at mitigating drought might highlight the relationship between ice cream sales going up and the drought getting worse — and might conclude that banning ice cream would also get rid of the drought. An expert in the field can provide hints to the model and check during the training phase that its interpretation of how variables relate to each other makes sense.
Because domain experts probably aren’t also expert in machine learning, Bourton suggests using visualizations that make the model clearer for them. The open source CausalNex toolkit for building causal models can create graphs showing the relationships of variables in the model.
Get the right split of data for training and validation
Training alone won’t result in sound, operational machine learning. You need to test and validate the system to ensure that its training is sufficient to deliver dependable results. This means you will need more training data than you might expect, to ensure you have enough reserved for testing. Gardner suggests a ratio of 70 percent of data for training and 30 percent for testing. “Train to what you think is a good level of accuracy and then go back and test it regularly on the control data set,” he says.
You may also need to reserve some data for validation, where you tune to confirm the model fits your data and to ensure you haven’t overfitted so the model can’t cope with new situations beyond the training data. Aristotelis Kostopoulos, vice president of AI product solutions at Lionbridge, suggests using 60 percent to 80 percent of data for training with 20 percent or 10 percent each for training and validation. “Use your validation set to run your hyperparameters to nail down things like how many neurons and how many layers you have, and the learning rate.”
Use synthetic data
If you don’t have enough training data, or if it would be impossible to gather a diverse enough range of data because data capture is onerous, you can use synthetic data — either transforming the data you do have so it becomes multiple examples (the Snow Leopard Trust flips images of snow leopards horizontally to help train image recognition to spot the animals whichever direction they’re facing) or generating the data whole cloth. Microsoft couldn’t motion capture enough people of different sizes and shapes in enough different poses to get all the data it needed to train its Kinect camera, so it synthesized millions of images showing hundreds of thousands of depth poses.
“Synthetic data is good when you don’t have the real data: You can create really good sets and you can use other techniques like rotating images, but you have to be careful not to introduce bias there,” Kostopoulos warns.
The machine learning method you choose affects how much data you will need to train with, so if you have a small data set you might choose semi-supervised active learning where you can use a smaller data set that’s annotated.
Version your data sets as well as your machine learning models
Machine learning requires constant re-evaluation of models that you are very likely to need to retrain. “In some cases when you train a model and reach a certain level of accuracy, maybe you’re done, but in many cases you’re wanting to bring in new data and retrain the model and get the accuracy a little bit higher,” Gardner notes.
To be sure you understand a machine learning model and can explain what it does, you need to be able to test and validate it on various data sets, so your machine learning pipeline has to be reproducible and auditable. To effectively bring in new data to retrain, retest and redeploy the model, you need an effective and at least partially automated MLOps process using tools such as Kubeflow. Keeping track of various training sets, with details such as provenance, data cleaning, transformation and validation is an important part of that. You could do that in a spreadsheet, through Jupyter notebooks or by adopting standardized datasheets for data sets.
Dataset and training metadata is very important for explaining machine learning results further down the line, says David Aronchik, co-founder of Kubeflow and now head of open source machine learning strategy at Microsoft.
“Maybe you want to explain the model, or you want to test for bias or to test against a whole set of different data, not to retrain it but to test against real production data to make sure it doesn’t fail,” he says. “Explainability relies on metadata about training. You need to know things like how your data was trained and what outliers did you exclude; do the statistics neglect a particular population for historical reasons? Was it trained naively or did you do an analysis of the population?”
Secure your training data against data poisoning
Don’t forget the basics. Even with machine learning, you need think about threat modelling and security during training. As with any important data, training data must be stored securely, not on a personal laptop. But you also need to be able to trace where your data comes, where it goes, and who can access it along the way.
“Even though this can be difficult, collect your data from multiple sources, if possible,” says Kenny Daniel, CTO and co-founder of Algorithmia. “If your model overfits, it can be open to adversarial attacks, like giving a pixelated blurry image and getting a cat prediction in return.”
That’s not just a theoretical problem. In 2013 attackers injected fragments of virus signatures into uninfected files that they uploaded to an online virus scanning site, which trained it to ignore some malicious files because they matched files that had been declared clean. In 2017, a high volume of fake traffic used to game the reputation of a digital certificate alerted Microsoft to the fact that attackers had figured out how its machine learning system calculated the reputation of certificates and were tricking it into trusting malicious code.
Unsupervised machine learning is more susceptible to data poisoning and even simple manipulation of images with the kind of filters on smartphones can produce something that looks like a good photo to a human but can affect your model. According to Microsoft, compromising 3 percent of training data can result in an 11 percent drop in the accuracy of the model. This is hard to defend against but attackers will only be able to control a small fraction of training samples, so you can validate and sanitize data, monitor the training loop for anomalies, look for feature drift day by day, use multiple models and have both automated and human evaluation of the quality of newly trained models before deploying them.
“If companies want machine learning systems and products to be robust to present and future threats, then focusing on securing the data processing pipeline is key,” says Ariel Herbet-Ross, a researcher at Open AI who specialises in adversarial machine learning and security.
Balance your data set
Making sure a small number of images don’t account for a large percentage of training helps against data poisoning and is also important for getting a useful machine learning model in the first place. “Trained algorithms are so data-oriented that if you pass in bad data, they will learn bad things,” Kostopoulos says. “In order to be able to train successfully you need a well-balanced corpus. If you’re doing classification, you don’t want to have the training set be 1 percent one example and 99 percent another example. You have to be careful when collecting your data to be balanced.”
Check inputs and outputs
If you use an machine learning platform, it will often have built-in tools for monitoring models and training. UiPath AI Fabric, for example, can compare model output to data input, and Azure Machine Learning (and the AI Builder tool in Power BI based on that) will show metrics such as false positives and negatives, precision (how many classifications are correct) and recall (how many true positives are correctly classified as positive).
If you’re working on your own machine learning models, use a real-time visualization tool such as Tensor Watch, so that you can see the errors and the training loss of machine learning models on-the-fly, without having to stop and restart training; that’s important when a 50-layer ResNet for image recognition could take 14 days to train on the ImageNet data set.
Choose the right hardware for training
GPUs are ideal for training deep learning systems, which require deep data parallelism, but they may not be the right choice for other workloads.
“Most enterprise customers won’t see a benefit from deploying GPUs or specialised AI acceleration hardware when they’re getting started with machine learning,” Gardner suggests. CPUs that are used for other workloads during business hours can be repurposed for training overnight if you’re only training a handful of layers, so you can save hardware acceleration for models with more than ten layers or places where faster training has an impact on the bottom line, like making a recommendation engine more accurate.
CPUs are also a better fit for higher-definition models such as medical images, astronomy and oil and gas exploration, he says: “Those images are massive with thousands pixels by thousands of pixels and they don’t fit the small memory footprint of GPUs; training on CPUs means you don’t have to slice or downsample images and you can run them at full resolution.”
To get the best performance with CPU training, look for frameworks and tools that have been optimised for CPU rather than GPU.
Take advantage of transfer learning
Don’t train from scratch when you don’t need to, many of our experts advised.
“Check to see if there’s publicly available models out there, and do transfer-learning instead. It’ll be cheaper, it’s more likely to be more accurate, and it’s better for the environment,” Daniel suggested.
This works by taking a large trained model, like a 150-layer ResNet image recognition model, slicing off the top couple of layers and retraining just those layers with your own data. For example, Azure Cognitive Services Custom Vision retrains Microsoft’s image recognition service on as few as 50 examples of each object you want to recognise in a matter of minutes. That gets you the convenience of a cloud service but custom trained to your scenario.
Machine learning is even less predictable than other software development, and the training phase can take longer than you expect, because training is part of developing the system not part of operationalizing it.
“It’s a little bit more unpredictable,” Gardner warns. “You can’t exactly always have a fixed timeline for when you’ll solve the problem or say, ‘In six months, we’ll have an answer.’ There’s a lot of trial and error in this space: It’s really a matter of hours or days or weeks spent on training to pick up subtle changes. It really takes management buy-in to cope with that level of uncertainty because although progress can come out of nowhere, more often it takes much longer than you expected. It’s not a linear straightforward process.”