Data science is having its 15 minutes of fame.
Everyone from John Oliver of HBO’s “Last Week Tonight” to famed election statistician Nate Silver of 538.com is getting on a soapbox about the perils of believing data-based findings that lead to seemingly crazy conclusions.
John Oliver noted one particularly dodgy finding that a glass of wine was as healthy as an hour at the gym. Another “study” supposedly proved the benefits of a chocolate diet for pregnant moms. And other studies have found that the number of suicides by hanging, strangulation and suffocation is highly correlated with U.S. spending on science, space and technology.
As those of us working in the business/data analytics field know only too well, the thing that each of these strange-but-unfortunately-true studies have in common is a failure to differentiate between data that shows correlations between variables — which is a statistician’s bread and butter — and data that establishes causality — data-tested conclusions that one thing actually causes another.
And while such confusion may not matter much if it leads to a pregnant mom eating an extra Hershey bar or two, it could be deadly to your company’s bottom line.
It may seem obvious, but as professors who study and teach data analytics, we see this problem repeatedly. Some business examples:
One common marketing tool is sending coupons by mail, email or mobile device to entice customers to buy more products. To evaluate the marketing impact of this tool, companies commonly build a simple statistical model, instructing their statistics department to measure the difference between the purchases by customers who used the coupons and those who didn’t.
Such an analysis (unwisely left to statisticians who know little about actual marketing strategy) often fails to take into account the fact that coupons are often sent most heavily to the company’s loyal customers. So unless the company controls for the level of sales they would have received from these loyal customers in the absence of coupons, the analysis will overstate the actual “coupon effect.”
The company, using the findings to make projections and set coupon strategy, could go “coupon crazy” and lose money unnecessarily.
Pricing is another area where correlation/causation confusion can be deadly. Imagine a coffee brand that is trying to estimate its price elasticity (i.e., how sensitive its customers are to price changes). To do that, they grab weekly sales and price data from a grocery store where multiple coffee brands are sold. At this particular store, the data for this single brand shows, somewhat surprisingly, that when the price is increased, sales increase.
So should that lead to a corporate strategy of nonstop price increases? Of course not. The marketing team could probably explain that in grocery stores product prices often are managed at the category level (all brands of coffee together) rather than brand by brand. So in this case, when the focal brand’s price was increased, the competitors’ prices may have been increased even more making the focal brand look cheaper by comparison.
Without information of what is happening in the market besides the focal brand’s price, simple statistical models often lead to misleading and confusing results.
In the world of digital marketing, companies spend ever-increasing millions of dollars in keyword search advertising. To decide which keywords to budget for most aggressively, companies typically measure click-through rate (CTR) — i.e., how many times the ads were clicked relative to how often the ads appeared in search results. The higher the CTR, the larger the budget for that keyword.
But this can be quite misleading if data is the only starting point.
For one thing, the company might have started with a small budget for a certain keyword, say “adorable baby clothes.” Thanks to the bidding system designed by Google AdWords, when a user searches for the keyword phrase “adorable baby clothes,” only the companies who bid above a certain algorithmic baseline for that keyword will get their ads shown to the user. If your company didn’t spend enough on “adorable baby clothes,” you’d have zero information about how many clickthroughs it might have generated for your company.
What’s more, most people do not click on ads beyond the first page of search result. So even if you do show up when someone searches “adorable baby clothes,” but you didn’t pay enough to get first-page positioning, you’d be similarly in the dark about how many clicks (and sales) you could have garnered by paying for top billing.
It would probably be smart to hold off analyzing the data and have the marketing team do more selective and careful experimentation, bidding more for keyword phrases like “adorable baby clothes.” That could generate better data to help determine the best search strategies.
What to do about it?
The single most effective remedy for eliminating each of these problems is to understand the exact process from which the data are generated — the Data Generating Process (DGP). Each data set is a set of records describing part of what has happened — a search term was chosen, clicks did or did not happen. In marketing, most data sets describe the behavior of consumers. But the data sets discussed above do not present a complete picture of the decision-making process by consumers — what would they have done if you’d had the top search term? How much would they have bought if you didn’t offer them a coupon?
The result: marketing decisions are made as if a correlation was actually causality.
There are several things you can do to understand the DGP and avoid these all-too-common pitfalls:
1. Graph the data
Data visualization has attracted much attention in the world of big data, in part because it helps to interpret the data and present it to non-data-savvy audiences. Applying even the simplest data plots often dramatically helps analysts understand the data and find anomalous points in the sample. Only after plotting the data and gaining a thorough understanding of the data can the analyst find the modeling approach that best suits the data and the question he/she is trying to address.
2. Include both marketing brainpower and statistical brainpower when it comes time to analyze data.
Although graphing will help you understand the DGP, it is ideal to talk to people who were directly involved in the DGP. For example, if the marketing team decides to whom to send coupons, they could explain the exact decision rules that were used to select recipients. This information is critical for the analysts who develop the statistical models.
3. When possible, evaluate each data point and see whether a compelling story can be told about why it takes on the value it does, rather than another value.
For example, if you see a very low clickthrough rate for a keyword that you believe is highly popular and relevant to your ads, type it into a search engine and see what shows up, especially where your ad is shown. As a quick test of whether the analyst has a good understanding of the DGP, the analyst could pick any data point and see if a complete story can be told about that data point and can be explained in layman’s terms.
4. When feasible, insist that your statisticians factor in behavioral-economic theory before finalizing their study.
Statisticians may insist that “we let the data choose the model,” but that can lead to mayhem if there are flaws in the data-generating process. Having a data analytical professional looking over their shoulders explaining the reality of the marketing landscape will help shape their analysis.
As John Oliver can attest, these common goofs are feeding a boom in demand for smart people with both business-analytic backgrounds and an understanding of econometrics.
Knowing where the statistical land mines lurk can at least be a first step in avoiding a data disaster.
Xiajing Dong is associate professor of marketing and business analytics and John Heineke is professor of economics and business analytics at Santa Clara University's Leavey School of Business.