by Mary Branscombe

12 data science mistakes to avoid

Feature
May 09, 2018
AnalyticsData ScienceIT Strategy

Well-managed analytics initiatives can reap organizational gold. But succumb to one of these common mistakes, and your data science operations can quickly go wrong.

data science certification brain with data
Credit: Thinkstock

AI, machine learning and analytics aren’t just the latest buzzwords; organizations large and small are looking at AI tools and services in hopes of improving business processes, customer support and decision making with big data, predictive analytics and automated algorithmic systems. IDC predicts that 75 percent of enterprise and ISV developers will use AI or machine learning in at least one of their applications in 2018.

But expertise in data science isn’t nearly as widespread as the interest in using data to make decisions and improve results. If your business is just getting started with data science, here are some common mistakes that you’ll want to avoid making.

1. Assuming your data is ready to use — and all you need

You need to check both the quality and volume of the data you’ve collected and are planning to use. “The majority of your time, often 80 percent of your time, is going to be spent getting and cleaning data,” says Jonathan Ortiz, data scientist and knowledge engineer at data.world. “That’s assuming that you’re even tracking what you need to be tracking for a data scientist to do their work.”

If you’re tracking the right data, you might not be recording it correctly, or the way you record it might have changed over time, or the systems you’ve collected it from might have changed while you were collecting data. “If there are incremental changes from month to month, then you can’t use that entire month of data when you perform an analysis or build a model,” cautions Ortiz, because the system itself has changed.

Even if you’re collecting the right data, low data volumes and large numbers of independent variables make it hard to create predictive models for business areas like B2B marketing and sales, explains John Steinert, chief marketing officer at TechTarget. “Data science gets better and better the more data you have; predictive models are more powerful the more data you have. Because transaction rates are low and independent variable affecting transactions are many, you’ve got small data sets and complex interactions and these weaken the power of predictive models.”

One option is to buy data sets like purchase-intent data, as long as you can find one that applies to your business segment. Another is to simulate the data, but that must be done carefully, warns Chintan Shah, senior consultant data scientist at Avanade. “In reality, the data may not behave according to the assumption you made in the beginning,” Shah says.

2. Not exploring your data set before starting work

You may have theories and intuitions about what your data set will show, but data teams should take the time to look into data in detail before using it to train a data model.

“If you see something counterintuitive it’s possible that your assumptions are incorrect or that the data is,” Ortiz says. “The most important thing I do is simply looking at the data, plotting it and doing exploratory analysis. A lot of people go through that too quickly or bypass it altogether but you need to understand what the data looks like. You can ascertain whether the data is telling you the proper story based on subject matter expertise and business acumen more quickly by doing some exploration beforehand.”

3. Expecting too much

The hype around AI has too many people convinced that “if we throw data at a computer algorithm, it will figure out everything by itself,” Shah warns. “Although companies have lots of data, human expertise is still required to bring the data into a usable format.”

Looking only at what your company has done before won’t uncover new opportunities, just ways to be more efficient at the same things you’ve already done, Steinert points out. “The more you use the past as the sole predictor of the future, the less open you are to looking for new avenues,” Steinert says. Even if you bring in third-party data to find demand for your products or services, it doesn’t guarantee you’ll be able to make those sales. “A data model can tell you a company is a good match for what you offer but it can’t tell you if that company has a need right now,” he adds.

“People are starting to invest and put trust in data scientists in ways they haven’t placed trust in different fields before, and they’re throwing resources at them and expecting a silver bullet to answer all their questions There’s a lot of faith being placed in this romantic view right now of data scientists and using data to answer questions and drive decisions,” Ortiz says.

Ortiz suggests data scientists should prove they can deliver by starting with small projects and quick wins to show the value to the organization. “Pluck that low-hanging fruit; don’t start by going down a technological rabbit hole and spending a month on a big project you think is going to be of tremendous value,” he says.

4. Not using a control group to test your new data model in action

If you’ve spent time and money building a data model, you want to use it everywhere you can to make get the most out of your investment. But if you do that, you can’t measure how well the model actually works. On the other hand, if users don’t trust the model they may not use it and then you can’t test it, Steinert says. The solution? A change management program to ensure the model is adopted, and a control group who isn’t using it, Steinert adds. Have a random group pursuing opportunities identified by the model, and a control group “pursuing things the way they have always be done, self-empowered, experientially.”

5. Starting with targets rather than hypotheses

It’s tempting to look for a data model that can offer specific improvements, like getting 80 percent of customer support cases closed in 48 hours or winning 10 percent more business in a quarter, but those metrics aren’t enough to work from.

“It’s better to start with a hypothesis when you can,” says Ortiz. “Often there’s a curve or a line you’re looking at as an overall metric and you want to move that; that can be a great business goal but it’s hard to imagine what levers you need to pull to do that.” Test your hypothesis about what will improve things, either with a control group or by exploring the data. “If you can run a test where you have a split test with a control group and both are representative samples, you can actually ascertain whether the method you’re using actually impacted what you wanted it to impact. If you’re just looking at data after the fact, beginning with the hypothesis can help narrow the scope. I need to increase this metric by 10 percent: What are my hypotheses for what might impact that and [then I can] do exploratory data analysis tracking just those in the data. Getting crystal clear on the question you’re asking and the hypothesis you’re testing can help reduce the amount of time you spend on it.”

6. Letting your data model go stale

If you have a data model that works well for your problem, you might think you can keep using it forever, but models need updating and you may need to build additional models as time goes on.

“Features will change over time,” warns Ortiz. “You’ll continuously need to understand the validity and update your model.”

There are plenty of reasons why models get out of date; the world changes and so does your company (especially if the model proves useful). “Models shouldn’t be viewed as static; the market certainly isn’t static,” Steinert points out. “If the market’s preferences are evolving away from your history, your history will put you on a diverging path. Model performance does decay. Or the competition learns from your company’s activity in the marketplace. Keep out a set for experimentation that says, ‘How will I be adding to the model over time?’ You have to have a set of experiments running that will surface new opportunities to differentiate.”

7. Automating without monitoring the final outcome

The other half of using a control group is measuring how good the output of the model is, and you need to track that all the way through your processes, or you end up optimizing for the wrong goal.

“Companies do things like apply a bot to your telephone service and you don’t continuously check whether the bot is leading to greater customer satisfaction, you just congratulate yourself on using less labour,” Steinert points out. If customers are closing support cases because the bot can’t give them the right answer rather than because it solved their problem, customer satisfaction will drop dramatically.

8. Forgetting the business experts

It’s a mistake to think that all the answers you need are in the data and a developer or data scientist can find them on their own. Make sure someone who understands the business problem is involved.

“While a knowledgeable and expert data scientist will be able to figure out the problem at hand eventually, it will be much easier if the business and data scientists are on the same page,” Shah explains. “The success of any data science algorithm lies with successful feature engineering. To derive better features, a subject matter expert always adds more value than a fancy algorithm.”

Start projects by having a conversation between the data team and the business stakeholder to make sure everyone is clear on what the project is trying to achieve, Ortiz suggests — even before you look at the data. “Then you can do exploratory data analysis to see if you can achieve it, and if not, you may have to go back and rephrase the question in a new way or get a different data source.” But it’s the domain expert who should be helping decide what the goal is and whether the project is delivering it.

9. Picking too complex a tool

The cutting edge of machine learning is exciting and new techniques can be very powerful, but they can also be overkill. “It may turn out that a simple method such as logistic regression or a decision tree will do the work,” Shah points out, and Ortiz agrees.

“It’s tempting to throw immense resources of computer power and sophisticated models at problems. Maybe I get intellectually curious about an aspect of a project and I want to test out a brand-new algorithm that will do more than was requested, or I just want to try it out. The job is to find a simple approach that answers the question. The simplest methods should be exhausted before you go on to more sophisticated options,” says Ortiz, noting that overfitting is more likely to happen with sophisticated algorithms like deep learning: “You get an extremely accurate model on the data you currently have that does not perform well at all with new information.”

Working with the business expert to decide what question needs answering should guide your choice of techniques. “A lot of data scientists focus on machine learning and a lot of machine learning is focused on prediction but not every question you answer will be a prediction question. ‘We need to look at sales from last quarter’ can mean a lot of different things. Do we need to predict sales amount for new customers or maybe you just need to know why sales seemed to stall in one particular week of last quarter,” Ortiz says.

10. Reusing implementations that don’t fit your problem

There are plenty of data science and machine learning examples that you can learn from and adapt. “One of the reasons behind the exponential growth in data science is the availability of open source implementation of almost all the algorithms, which makes it easy to develop a quick prototype,” explains Shah. But those implementations are often developed for specific use cases. If what you need from the system is different, it’s better to build your own version, he says. “Implement your own data cleaning and feature building routines,” he suggests. “It gives you more control.”

11. Misunderstanding fundamentals like causation and cross validation

Cross validation helps you estimate the accuracy of a prediction model when you don’t have enough data for a separate training set. For cross validation, you split the data set up several times, using different parts to train and then test the model each time, to see whether you get the same accuracy no matter which subset of your data you train with. But you can’t use that to prove your model is always as accurate as its cross-validation score, Ortiz explains. “A generalizable model is one that reacts in an accurate way to new incoming data but cross validation can never prove that.” Because it only uses the data you already have, it just shows that your model is as accurate as possible for that data.

Just as fundamentally, “correlation is not causation; seeing two things that are correlated does not mean that one impacts the other,” he points out. (Check out Spurious Correlations for some amusing correlations of unconnected data.) The exploratory plotting you do with your data set will give you a sense for what it can predict and which data values are correlations that don’t tell you anything. If you’re tracking customer behaviour on your ecommerce site to predict which customers will return and when, recording that they logged in doesn’t tell you anything because they’ve already come back to your site to do that. “Logging in is going to be highly correlated with returning but it would be a mistake to incorporate that into the model.”

12. Underestimating what users can understand

Business users might not be able to perform statistical analysis themselves, but that doesn’t mean they don’t understand margins of error or statistical significance and validity, Ortiz points out.

“Often when an analysis is going to business teams it will end up as just one slide with just one number, whether it’s an accuracy figure or an estimate or a prediction or a forecast; but the margin of error is very important when you provide that one value,” Ortiz says.

If business decisions are being made on the basis of data analysis, make it clear how much confidence to put in the result or decision makers will find it hard to trust the system — and don’t assume they’re not technically sophisticated enough to understand that.

More on data sceince and data analytics: