Social Data Doesn't Have to Be Big Data to Be Useful

Mention 'social data' or 'sentiment' these days and a conversation about big data is sure to follow, but you don't necessarily need a Hadoop cluster to leverage unstructured data--sometimes all you need is Twitter's native Search API.

Tue, September 25, 2012

CIO — There's a lot of excitement around big data these days in all industries, from retail and marketing to manufacturing and even information security. Everyone is excited by the prospect of turning easily accessible data into a goldmine. But data doesn't have to be big to be valuable.

For retailers and marketers especially, sentiment data—the kind often captured in social networks like Facebook and Twitter—can be especially valuable. Often, expensive data analytics and data visualization packages are necessary to extract actionable information. But if you don't need to correlate many disparate streams of data, simpler tools may give you exactly what you need.

[eBook: Strategic Guide to Big Data Analytics]

A group of researchers from the University of Rochester this year published a paper, Modeling Spread of Disease from Social Interactions, which showed how they used the native Twitter Search API and support vector machine (SVM) algorithms to study the spread of infectious diseases.

Researchers Use Twitter to Study Contagion

"Imagine Joe is about to take off on an airplane and quickly posts a Twitter update from his phone," the authors—Adam Sadilek and Henry Kautz of the Department of Computer Science, and Vincent Silenzio of the School of Medicine and Dentistry—write in their paper.

"He writes that he has a fever and feels awful. Since Joe has a public Twitter profile, we know who some of his friends are, and from his GPS-tagged messages we see some of the places he has recently visited. Additionally, we can infer a large fraction of the hidden parts of Joe's social network and his latent locations by applying the results of previous work, as we discuss below," Sadilek, Kautz and Silenzio write.

"In the same manner, we can identify other people who are likely to be at Joe's airport, or even on the same flight. Using both the observed and inferred information, we can now monitor individuals who likely came into contact with Joe, such as the passengers seated next to him. Joe's disease may have been transmitted to them, and vice versa, though they may not exhibit any symptoms yet. As people travel to their respective destinations, they may be infecting others encountered along the way. Eventually, some of the people will tweet about how they feel, and we can observe at least a fraction of the population that actually contracted the disease."

Sadilek, Kautz and Silenzio conducted their work using the Twitter Search API, which allowed them to collect a sample of public tweets from the New York metropolitan area. They collected the tweets for a month beginning on May 18, 2010. They used a Python script to periodically query Twitter for all recent tweets within 100 kilometers of the city center, and they distributed the queries over a number of machines with different IP addresses that asynchronously queried the server to avoid exceeding Twitter's query rate limits. They merged the results and then concentrated on the 6,237 users who posted more than 100 GPS-tagged tweets during the month.

Continue Reading

Our Commenting Policies