Sunday, January 15, 2012

Twitter Sentiment Analysis

In this day and age of instant communication, currency as well as correctness of data exchanged or published has increased in importance. In the old days, clocks used to have just the hour hand... now even the second hand appears to some to move too slowly. We have advanced to a state where every second, and sometimes milliseconds, and even microseconds matter.

In such a world, as newer generations of people are growing up in a world where communication is ubiquitous and they are less inhibited about sharing personal details on the web, commercial consumers of such information also want the ability to parse and analyze this content in aggregate in order to profit from it.

For example, in media like Twitter - the popular micro-blogging site - people spread out across the world can report on local happenings almost instantaneously, way before news-channels pick up the stories, and since large numbers of people tend to micro-blog about any event (and the number of people doing this will only increase over time), consumers can rely on "the wisdom of the crowds" to accurately aggregate correct information (these tweets "go viral") while seeing false stories dissipate without really catching on.

We use Twitter merely as an example of a dynamic and evolving medium that quickly tracks public sentiment. We could just as easily have used Facebook or Google+ or any such similar medium for that matter, to similar effect.

So what are some applications of twitter sentiment analysis? Some firms use sentiment analysis as a signal for investing in emerging market or frontier market economies. Others use this as a means of active marketing or even to determine the effectiveness or efficacy of a particular marketing strategy (think micro- as well as macro- targeting). Political pundits might use this as a basis for predicting popular sentiment going into an election. Strategists might use it to determine what issues really touch the masses and focus on positioning their candidates better in those areas. If geo-location data was attached to each tweet (currently it is an optional field and only very few tweets have this data available), one can even determine how geographical factors play into people's opinions. In addition, one can form an opinion as to the "distance" between a tweeter on the ground and the incident s/he is reporting about, lending more credibility to eye-witnesses. Several other applications of tweets exist, and these will only become more obvious over time as the micro-blogging medium matures.

An interface into twitter that can collect the tweets in a timely manner and then analyze them to collect the sentiment of the environment or the people as a whole with regard to any particular subject matter in aggregate becomes useful. This is what is popularly referred to as "sentiment analysis". We implement a very simple twitter feed sentiment analyzer in what follows. As with all our other examples, we build this in Python 2.x. We employ a very simple algorithm for sentiment analysis as described below. A more detailed discussion of what people are doing in this field is here, here, here, and here. And of course, while you're at it, not a bad idea to look at the Wikipedia article on the same.

Let us now consider what an implementation involves.
  1. First, we need to connect to, and read a twitter field, and filter it for any particular topic of interest. To do this, we utilize the excellent Open Source Twython API available for free download on the Internet. Twython has extensive documentation with examples that enables us to easily connect to a twitter feed and search on particular terms.
  2. Next, we generate a list of sentiment indicators. To do this, we may look at a sample of tweets, then decide what words indicate positive and negative sentiments, and collect them for use in checks. Tweets that have more negative words than positive words may be considered negative tweets and vice-versa (our analysis ignores satirical or sarcastic comments, these are mis-classified in the first implementation).
  3. We read the sentiment indicators and use them to parse and classify the tweets from (1) above. Next, we compute the number of negative and positive tweets as a percentage of the total. In the current implementation we only have a set of negative keywords and treat all non-negative tweets - i.e. tweets without at least one negative word in them - as positive tweets. In later versions we will expand on this to classify our tweets into three buckets: positive, negative, and neutral.
  4. Finally, we write our computed percentages of positive and negative tweets to file so we can study the temporal evolution of sentiment (i.e. evolution of sentiment regarding that topic over time).
Code follows below, with output underneath:
# twitter sentiment analysis 

# (c) Phantom Phoenix, The Medusa's Cave Blog 2012
# Description: This file performs simple sentiment analysis on global twitter feeds
#              Later versions may improve on the sentiment analysis algorithm

import os, sys, time; # import standard libraries
from twython import Twython; # import the Twitter Twython module
twitter = Twython(); # initialize it

f=open("neg.txt","r"); # open the file with keywords for sentiment analysis
fc=f.readlines(); # read it
wrds=[i.replace("\n","") for i in fc]; # save it for use
g=open("twit.txt","w"); # open the output file
g.write("@time, #tweets, sentiment index: % positive, % negative\n"); # write header to it
while True: # forever...
 search_results = twitter.searchTwitter(q=sys.argv[1], rpp="500");
 # search twitter feeds for the specified topic of interest with 
 # max results per page of 500.
 x=search_results["results"]; # grab results from search
 r=[]; # create placeholder to hold tweets
 for i in x: 
  txt="".join([j for j in t if ord(j)<128]);
  # parse tweets and gather those not directed to particular users
  if txt.find("@")==-1: r+=[txt]; 

 neg=0; # set counter for negative tweets
 for i in r:
  for j in wrds: # check against word list, calculate how many negatives
   if i.lower().find(j)>-1:
  if j=="": 
   #print "pos,",i; # treating non-negatives as positives in this iteration
   pass;            # this may change later as we evolve to support three categories
   #print "neg,",i;

 #print "number of negatives: ",neg;
 #print "sentiment index: positive: %5.2f%% negative: %5.2f%%" %((len(r)-neg)/float(len(r))*100,neg/float(len(r))*100);
 #print ",".join([time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),str(len(r)),str((len(r)-neg)/float(len(r))*100)[:5]+"%",str(neg/float(len(r))*100)[:5]+"%"]);
 g.write(",".join([time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),str(len(r)),str((len(r)-neg)/float(len(r))*100)[:5]+"%",str(neg/float(len(r))*100)[:5]+"%"])+"\n");
 g.flush(); # write output to file, flush it, then sync it to disk immediately.
 time.sleep(180); # sleep for 3 mins then try again

g.close(); # close file after forever and exit program
sys.exit(0); # these two lines never reached but kept for completeness


@time, #tweets, sentiment index: % positive, % negative
2012-01-16 01:40:38,69,55.07%,44.92%
2012-01-16 01:43:38,58,60.34%,39.65%
2012-01-16 01:46:39,59,57.62%,42.37%
2012-01-16 01:49:40,48,64.58%,35.41%
2012-01-16 01:52:40,52,59.61%,40.38%
2012-01-16 01:55:41,52,67.30%,32.69%
2012-01-16 01:58:42,50,72.0%,28.0%