The manner in which Twitter data can be analyzed to reveal interesting trends or even predict events has been gaining popularity for some time now. However, a characteristic of Twitter (and other social media) data that is not often discussed is whether the sample is representative of the population, and how this impacts the reliability of the analytical results. Let’s look at two recent examples.
First is a claim that tweets can be used to predict fluctuations in the stock market three days in advance. Two Indiana University-Bloomington researchers used a standard psychology tool covering six mood states using 72 different adjectives and applied them to almost 10 million tweets from 2.7 million tweeters between February and December 2008. The researchers compared the national mood to the Dow Jones Industrial Average and “found that one emotion, calmness, lined up surprisingly well with the rises and falls of the stock market – but three or four days in advance.” To test whether Twitter could be used to tell the future, they “trained a machine-learning algorithm to predict whether the stock market would go up or down, first using only the [DJIA] from the past three days, then including emotional data. The algorithm did pretty well using stock market data alone, predicting the shape of the stock market with 73.3 percent accuracy. But it did even better when the emotional information was added, reaching up to 86.7 percent accuracy.”
While acknowledging that their algorithm is highly simplified, the researchers noted it’s reasonable to assume that people’s moods will have some effect on their investments, although more research is needed to figure out exactly how. Critics were skeptical, noting that not everyone on Twitter plays the stock market, or even lives in the U.S. One said he would like to see the algorithm used on tweets over a longer span of time.
Our second example is a recently released map that portrays “The United States of Swearing” using geo-coded tweets as the data set. The cartographer who made this received some criticism for some of his choices, particularly the way he made the legend to represent the distribution of swearing across the U.S. He addressed those mainly by pointing out that large degrees of generalization were necessary due to sampling limitations of geo-referenced Twitter data. The map is only useful for examining broad spatial trends, not examining the characteristics of specific localities, especially in rural areas where tweets are more sparse – in terms of both spatial distribution and numbers of tweets. The explanation of his decisions on generalization and data binning can be found here.
Although they have nothing to do with each other in terms of subject matter, these two stories highlight important – and often overlooked – caveats that come with using social media data. Huffman’s disclosure of the limitations associated with sampling Twitter data exposes the fact that although social media-derived data represent a promising potential of raw information, much work must be done before we can be confident we are accurately identifying trends or predicting event outcomes.
Digital Sandbox is the leader in public safety risk management, providing analytic tools and information products to government agencies and large enterprises, for optimizing risk-based strategic, policy, and budgetary decisions.Join The Conversation »