There's a lot of excitement around big data these days in all industries, from retail and marketing to manufacturing and even information security. Everyone is excited by the prospect of turning easily accessible data into a goldmine. But data doesn't have to be big to be valuable.
For retailers and marketers especially, sentiment data—the kind often captured in social networks like Facebook and Twitter—can be especially valuable. Often, expensive data analytics and data visualization packages are necessary to extract actionable information. But if you don't need to correlate many disparate streams of data, simpler tools may give you exactly what you need.
A group of researchers from the University of Rochester this year published a paper, Modeling Spread of Disease from Social Interactions, which showed how they used the native Twitter Search API and support vector machine (SVM) algorithms to study the spread of infectious diseases.
Researchers Use Twitter to Study Contagion
"Imagine Joe is about to take off on an airplane and quickly posts a Twitter update from his phone," the authors—Adam Sadilek and Henry Kautz of the Department of Computer Science, and Vincent Silenzio of the School of Medicine and Dentistry—write in their paper.
"He writes that he has a fever and feels awful. Since Joe has a public Twitter profile, we know who some of his friends are, and from his GPS-tagged messages we see some of the places he has recently visited. Additionally, we can infer a large fraction of the hidden parts of Joe's social network and his latent locations by applying the results of previous work, as we discuss below," Sadilek, Kautz and Silenzio write.
"In the same manner, we can identify other people who are likely to be at Joe's airport, or even on the same flight. Using both the observed and inferred information, we can now monitor individuals who likely came into contact with Joe, such as the passengers seated next to him. Joe's disease may have been transmitted to them, and vice versa, though they may not exhibit any symptoms yet. As people travel to their respective destinations, they may be infecting others encountered along the way. Eventually, some of the people will tweet about how they feel, and we can observe at least a fraction of the population that actually contracted the disease."
Sadilek, Kautz and Silenzio conducted their work using the Twitter Search API, which allowed them to collect a sample of public tweets from the New York metropolitan area. They collected the tweets for a month beginning on May 18, 2010. They used a Python script to periodically query Twitter for all recent tweets within 100 kilometers of the city center, and they distributed the queries over a number of machines with different IP addresses that asynchronously queried the server to avoid exceeding Twitter's query rate limits. They merged the results and then concentrated on the 6,237 users who posted more than 100 GPS-tagged tweets during the month.
Once they had narrowed the population to users they could reliably follow geographically, they still needed to deal with class imbalance: health-related tweets are relatively scarce compared to other types of messages and so reliably classifying them is tricky. To do so, they trained two different binary SVM classifiers&—SVM is an established model of data in machine learning—to accurately distinguish between tweets that indicated the tweeter was sick and all other tweets. One SVM classifier was highly penalized for inducing a false positive (labeling a normal tweet as one about sickness), while the other was heavily penalized for creating a false negative (labeling a tweet about sickness as a normal tweet).
Part of that process involved weighting "features"—essentially keywords—to help the SVMs distinguish between "sick" and normal tweets. For instance, the feature "sick" in a message received a positive weight of 0.9579. However, the feature "sick of" received a negative weight of -0.4005, indicating a lower likelihood that the tweeter was ill.
At the other end, they were able to extract more than 700,000 "sick" messages. The researchers then studied the movements of the users who posted these messages, using their Twitter friendships to gain deeper insight into how the contagion spread:
"To quantify the effect of social ties on disease transmission, we leverage users' Twitter friendships," they wrote. "Clearly, there are complex events and interactions that take place "behind the scenes", which are not directly recorded in online social media. However, this work posits that these latent events often exhibit themselves in the activity of the sample of people we can observe. For instance, as we will see, having social ties to infected people significantly increases your chances of becoming ill in the near future."
However, we do not believe that the social ties themselves cause or even facilitate the spread of infection. Instead, the Twitter friendships are proxies and indicators for a complex set of phenomena that may not be directly accessible. For example, friends often eat out together, meet in classes, share items and travel together. While most of these events are never explicitly mentioned online, they are crucial from the disease transmission perspective. However, their likelihood is modulated by the structure of the social ties, allowing us to reason about contagion."
Marketers Use Twitter to Find Potential Customers
These techniques aren't just useful to researchers. Cold-remedy maker Cold-EEZE and social marketing firm Refine+Focus built Cold-EEZE's social marketing strategy around the research. Refine+Focus founder and CEO Zach Braiker explains that a Cold-EEZE community manager monitors Twitter for cold symptom indicators and then reaches out to form a connection with users tweeting about symptoms.
"We look for people who are expressing cough and cold symptoms," Braiker says. "We respond to nearly everyone that meets those certain criteria and often it creates a meaningful interaction. In some cases, it results in a real friendship."
He notes that this doesn't require Hadoop clusters or expensive data visualization solutions, just Twitter's Search API and a competent community manager.
"For our needs, we're able to use the Twitter interface directly because we have very specific searches that we have pregenerated," he says. "For the most part, what helps with the process at this level is to have a competent community manager that's constantly looking at the feeds and making a human decision to interact with somebody."
One example involves an athlete who was expressing concerns about a cough before participating in an Ironman competition. Using that information, Cold-EEZE sent a care package to help the athlete overcome the cough ahead of the race.
Genuine Interactions Are Essential
The key, Braiker says, is to create real, genuine interactions.
Whether you're looking at Twitter, Facebook or some other social network, he recommends identifying the people who will be most receptive to your message and then engage them with quality conversations. Remember details about them and what they've said in the past. Use names and talk about things of substance.
"Sometimes companies make really big mistakes because they just start promoting themselves nonstop," he says. "It's almost like being on a date and not going through the process of trying to know someone first. It really undermines how those tools are best used. You have to genuinely care about their interests and create a true connection with real conversation."
Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline and on Facebook. Email Thor at firstname.lastname@example.org