DSBlog

Comments (0)

The manner in which Twitter data can be analyzed to reveal interesting trends or even predict events has been gaining popularity for some time now. However, a characteristic of Twitter (and other social media) data that is not often discussed is whether the sample is representative of the population, and how this impacts the reliability of the analytical results. Let’s look at two recent examples.

First is a claim that tweets can be used to predict fluctuations in the stock market three days in advance. Two Indiana University-Bloomington researchers used a standard psychology tool covering six mood states using 72 different adjectives and applied them to almost 10 million tweets from 2.7 million tweeters between February and December 2008. The researchers compared the national mood to the Dow Jones Industrial Average and “found that one emotion, calmness, lined up surprisingly well with the rises and falls of the stock market – but three or four days in advance.” To test whether Twitter could be used to tell the future, they “trained a machine-learning algorithm to predict whether the stock market would go up or down, first using only the [DJIA] from the past three days, then including emotional data. The algorithm did pretty well using stock market data alone, predicting the shape of the stock market with 73.3 percent accuracy. But it did even better when the emotional information was added, reaching up to 86.7 percent accuracy.”

While acknowledging that their algorithm is highly simplified, the researchers noted it’s reasonable to assume that people’s moods will have some effect on their investments, although more research is needed to figure out exactly how. Critics were skeptical, noting that not everyone on Twitter plays the stock market, or even lives in the U.S. One said he would like to see the algorithm used on tweets over a longer span of time.

Our second example is a recently released map that portrays “The United States of Swearing” using geo-coded tweets as the data set. The cartographer who made this received some criticism for some of his choices, particularly the way he made the legend to represent the distribution of swearing across the U.S. He addressed those mainly by pointing out that large degrees of generalization were necessary due to sampling limitations of geo-referenced Twitter data. The map is only useful for examining broad spatial trends, not examining the characteristics of specific localities, especially in rural areas where tweets are more sparse – in terms of both spatial distribution and numbers of tweets. The explanation of his decisions on generalization and data binning can be found here.

Although they have nothing to do with each other in terms of subject matter, these two stories highlight important – and often overlooked – caveats that come with using social media data. Huffman’s disclosure of the limitations associated with sampling Twitter data exposes the fact that although social media-derived data represent a promising potential of raw information, much work must be done before we can be confident we are accurately identifying trends or predicting event outcomes.

––––––––––––––––

Digital Sandbox is the leader in public safety risk management, providing analytic tools and information products to government agencies and large enterprises, for optimizing risk-based strategic, policy, and budgetary decisions.

Join The Conversation »
Comments (0)

Last Sunday’s print edition of The New York Times features a full-page visualization of 2010 fatalities in Iraq and Afghanistan. Like many Times graphics, it does a good job of conveying a lot of detailed information in a confined space – notably how combat- and non-combat-related deaths among U.S. and coalition forces in Iraq have dropped as troops have been withdrawn and the focus has shifted to Afghanistan.

The 2010 death toll was 56 in Iraq, where troop levels fell by half, down from 141 in 2009. Deaths in Afghanistan, where a U.S. surge added 40% more troops, totaled 696 last year compared to 498 in 2009.

The graphic (which can also be viewed by clicking on the expandable thumbnail image at right) uses data from a variety of sources, including the Pentagon, as well as research from the Brookings Institution and the design services of Brooklyn-based mgmt.design.

Read the Times article here.

A similar version of the mgmt.design chart showing 2009 fatalities can be viewed here.

––––––––––––––––

Digital Sandbox is the leader in public safety risk management, providing analytic tools and information products to government agencies and large enterprises for optimizing risk-based strategic, policy and budgetary decisions.

Join The Conversation »
Comments (0)

There is no doubt the U.S. federal government has an interest in protecting information that, although unclassified, is sensitive enough in nature that strict controls over its use and distribution are essential. This information, known as Sensitive But Unclassified (SBU), is a broad category that includes, but isn’t limited to, material covered by such designations as For Official Use Only (FOUO), Law Enforcement Sensitive (LES), Sensitive Homeland Security Information (SHSI), Security Sensitive Information (SSI) and Critical Infrastructure Information (CII).

According to the National Archives and Records Administration (NARA) there are currently over 100 ways of characterizing SBU information, each of which has its unique policies and procedures for protecting this information. Such an assemblage results in inconsistent categorizing of information, which may inadequately protect sensitive information and/or unreasonably restrict access to benign information.

In an effort to eliminate this “inefficient” and “confusing patchwork” of policies and procedures for protecting SBU, the Obama Administration on November 4, 2010, issued Executive Order 13556, which established a Controlled Unclassified Information (CUI) program.

According to the order, the CUI program standardizes and simplifies the way the executive branch handles unclassified information that requires safeguarding or dissemination controls. At its core, the CUI program consolidates the characterization of SBU information under a universal set of guidelines.

Federal departments and agencies have been put on the fast track to implement the requirements of the order. By this spring each agency head must submit a catalog to NARA, the designated executive agency, with its proposed categories and subcategories of CUI. Within the same time frame, NARA is required to issue initial directives for implementation of the order. Once these directives are revealed, agency heads will have 180 days to submit an implementation plan to NARA.

Time will tell how each agency will respond to the NARA directives and how this change will impact the designation process of CII, LES, and SSI, if at all. However, it is already clear that this order will dramatically change the way the U.S. federal government protects its information.

––––––––––––––––

Digital Sandbox is the leader in public safety risk management, providing analytic tools and information products to government agencies and large enterprises, for optimizing risk-based strategic, policy, and budgetary decisions.

Join The Conversation »

Get In Touch

Have a question or want to give us feedback?

Email Us »