Carnegie Mellon University researchers demonstrated how Tweets ("Twitter data" as they call them) replicate opinion estimation for two major U.S. indicators, with some correlations "as high as 80%", suggesting that readily available texts from social sites could complement and even replace organic polls in the future.
In a study called From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series, researchers found that online social conversation analysis could also "capture important large-scale trends."
The aim was to compare the reflection of a) consumer confidence and b) political opinion -- namely presidential job approval -- in 2008 and 2009 as provided by, on the one hand, Twitter messages, and on the other hand, the traditional opinion measurement polls.
What the researchers did was collect 1 billion Tweets through Twitter's API as well as the real-time flow, between 2008 and 2009. They estimate the number of daily messages at 100,000 to 7 million, bearing in mind that the main reason for the variation is Twitter's own growth.
The "organic" polls that were used to compare consumer confidence were the Index of Consumer Sentiment (ICS) from the Reuters/University of Michigan Surveys of Consumers and the Gallup Organization's "Economic Confidence" index.
The study findings will be presented May 25 at the Association for the Advancement of Artificial Intelligence's International Conference on Weblogs and Social Media in Washington, D.C.
The Role Of Search
If mining Tweets would become a faster and less expensive alternative to traditional polls, this means that search will play a bigger role in those fields too, as it will be greatly needed to dig out targeted messages out of the noise of billions of "freely available text content", as the University puts it.
As we speak, Twitter's Search tool has already factored in negative and positive sentiment content search within its advanced query page and offers filters called "Twitter Search operators" that allow to sift through the buzzing noise... and information. Geo-localization is also permitted, be it either at a specific location or within up to 1,000 miles and/or kilometres of a given point.
Ethics is the key word in polls and survey -- when they are done properly. Of course, exploiting such data as social conversation content implies to make some critical editorial calls. For instance, how to include retweets in the analysis? Especially since some of them are not clearly marked as such. How to ponder their importance as they are repeats but still reflect people's opinions and oftentimes sentiment - through emoticons.
That's the second point: how to integrate emoticons as they are not usual language but signs? What about acronyms?
Finally, only a fraction of Twitter accounts are verified. One has to be considered as someone before being verified. So how about the fact that it is extremely easy to create multiple accounts without any verification? It would be extremely easy to rig such polls...
Writers of the study also called for caution in terms of "temporal smoothing", which they deem "crucial": they explain that with the nature of real-time information, such smoothing - which corresponds to an elaborate calculation formula - "causes the sentiment ratio to respond more slowly to recent changes, thus forcing consistent behavior to appear over longer periods of time."
With now over 105 million users, Twitter is growing at a mind-blowing rate of 1,382%. This alone has a major impact on opinion analyses as the demographics of the user population change; communication habits both differ and evolve; languages multiply as the sites' geographical spread increases.
The issue of the reference language lexicon arises, as in English only, tweets are "written in an informal social media dialect of English, with different and alternately spelled words", the researchers said. So try to imagine in all the other langueges. Using advanced Natural Language Processing (NLP) techniques in the future is one option but adapting to the changing nature of language as social sites grow and mature remain a challenge.
Besides, the Carnegie Mellon study is based on Twitter but suggests the use of "text-based social media" in general as "millions of people broadcast their thoughts and opinions on a great variety of topics". Other such social sites are also expanding at an exponential rate and such sites, by definition, have a worldwide reach.
Save Money, Save Time, Go Global - Use Social Conversation
So is this a challenge or rather a huge opportunity for politicians to tap into on a global scale and adapt their decisions and moves accordingly?
If this trend is confirmed, expensive and time-consuming 20th-century survey and polling methodologies could indeed soon be replaced by the study of publicly available data on social site -- IF and only if the issues raised above are solved first.
And once that is done it will be a great opportunity.
Search professionals, it might soon be your time to shine in the political arena ... and your help might well be needed way before that in order to sort through the data !