Saturday, January 21, 2012

On the design of Recommender Systems

Recommender Systems or Recommendation Systems (henceforth RS) are the systems used in many popular websites today to drive up traffic, sales, and overall customer satisfaction. These are used successfully in many Internet Services websites like Amazon, Netflix etc., to generate more revenue. When implemented and used correctly, these systems can massively impact business metrics driving up growth, revenues, and yes, even profitability (though this last metric also depends on other business parameters). In this blog post, we study some design factors involved in the construction of such systems.


Disclaimer: we do not claim any of these ideas below are original. We will try and add references later for some of the more important ideas, as time permits, to help the interested reader along a voyage of discovery.


1. Frequent Itemsets
This idea comes from the field of marketing analytics. Did you know that beer and baby diaper sales in the US are highly correlated? Apparently when dads go (or are sent) to the nearby store to pick up baby diapers, they take it upon themselves to restock on their beer to make better use of the trip, help save the environment (drive to store once, buy more things ...), etc. Stores that have customer loyalty programs can pick up on these kinds of trends more easily, since they know the content of most "shopping baskets", and can decide which sets of items (called frequent itemsets), go together most often, and are purchased most frequently. This way, they can decide to upon appropriate marketing strategies to maximize their revenue - e.g. have sales on both beer and diapers at the same time, or say give out larger discounts on diapers than neighboring stores, but hike up the price of beer to compensate etc.,


Finding and tracking frequent itemsets is something of an art form, and requires data mining and statistical analysis. Different firms do this differently with different levels of sophistication, and more complex methods are not necessarily better all the time...


Online stores can capitalize on this idea in other different and interesting ways. An RS for Amazon or other online store can for example direct people who have purchased certain items, to other items people with similar purchasing histories may have bought or may have considered buying (e.g. by tracking their page views). Here, the website is using information about customers and prospective customers to guide others along the same path. 


The logic goes: "Buying a toy? Perhaps batteries are needed. 95% of people buying this toy also bought batteries at the same time. And oh, 40% of those buying this toy wanted this cool rubber casing with it too. And did you consider how much nicer this expensive toy would be with this mod so many others like to buy? ... " and so on. 


Or... like "Cool Hand Luke"? Maybe you will like other movies that people who liked this movie will like? Perhaps it is more likely you will like movies that others liked if they also liked most movies that you say you liked, and less of the movies you say you do not like? (this is where the user ratings come in handy at Netflix, so your "basket" is the set of all movies you have rated, and can be compared with other baskets to find those that have similar characteristics to locate those that have other items your's does not have, and recommend these as movies to watch).


Not too intrusive (not overtly anyway), but helpfully plugging just the right products for you to find. And of course, there is a search facility that lets you find not just what you need, but multiple options for similar things, all neatly indexed in decreasing order of potential relevance, with the option for you to sort things along other parameters you might like.


2. The Wisdom of Crowds
Ever wonder what makes Amazon.com so much more successful than other book stores with an online presence? A key factor is the facility that enables customers to share their experiences about a product or service sold through the store-front. For instance, if on site A, I can read all about a book I want to buy, see other customers' reviews of the same, see customer posted product pictures, also see other vendors' prices with the ability to comparison shop, and share my feedback either positive or negative with the larger body of consumers, why would I not pick this site over competitors', especially if I can also get better deals there?


MBAs are usually taught that some platforms are "multi-sided markets" (think credit cards, or development APIs for example). Developing such platforms from the ground up can be a challenge, because you need to simultaneously grow all communities around the platform. But once there is sufficient interest and a core ecosystem becomes functional, it takes off in a virtuous cycle and things can grow very rapidly and the business takes off. Of course, in an environment with increasing "clock-speed" (reduced cycle times), successes and failures typically become evident very quickly as well. Agile companies can just as quickly increase investment in ideas that seem to work, and kill those that seem to be doing badly. But we digress...


With the advent of media like Facebook and Twitter, many websites now permit users to share their purchases with friends from those and other communities. This creates a following... "I am more likely to like things my friends like, so perhaps I should check out the things they are buying..." and so on. This is kind of a subliminal recommendation system. Works on the sly. See model (5) below for more.


Wikipedia is, to date, perhaps the best positive example of the wisdom of crowds. But even there, the truly wise are very few... and there are far far more readers than writers. The same applies to product ratings. Most people will at best, only rate a product. People that write reviews will be a much smaller number. And those that do this without a hidden motive (there have been several reported instances of restaurants reviewing themselves over Yelp!, of authors reviewing their own books glowingly on Amazon etc) would be an even smaller subset.


3. Free Samples for All
This builds on ideas from (1) above. So much better to give people the option to "try before you buy". Consumption of books, movies, songs etc. require an investment of time, money, effort on part of the consumer. Can you make them understand that the 2 hrs it takes to watch this movie, the $12.95 it takes to buy it, is actually worth it, as opposed to, say doing something else with that money and time? Give them a free preview. Tell them "no pressure, we think this is something you'd like, and as you can see, it is really not that expensive. Check out how much other people are selling it for. And look how much other people like you seem to like it! just buy it now from us, and don't worry about it till it is time to pay your credit card bill". That's what free samples do. Of course, occasionally you will have free-loaders that will take your samples and never buy, but that's a chance you have to take.


4. A Thematic Match
Sometimes, it is also convenient and useful for systems to generate a thematic match. If you like certain movies for example, rather than look for movies that others who liked your favorites have watched (which requires gobs of data), why not recommend other movies with similar thematic elements that will in all likelihood interest you? Or perhaps you liked the movies you liked because of the actor or actors in them? So other movies with the same "elements" - defined as actors, story lines etc that are similar might interest you? Of course to find exactly what makes someone like a certain movie requires some deep statistical analysis of what similar elements exist across their viewing history, or (here's a novel idea!) asking them to tell you what kinds of movies they like. "Do you like to watch mysteries frequently? sometimes? never? What about dark, gritty, thrillers?" 


Thematic match is an interesting scheme that can get results quickly with more information from the public domain, and some patient customer prodding. Easier to get started with this, build a community of loyal users, then exploit the data they leave behind as they engage more actively with your website giving you more data to mine with models (1) and (2) above, giving you more sales, encouraging your content suppliers to give you more content with free samplers - model (3) - (now that your ecosystem is humming along), and so on...


5. Link to Social Media
Think also of social media used to create a buzz. This is a key new development in our age. "OMG! I just read Girl With The Dragon Tattoo and loved it!" someone tweets or posts on Facebook. And immediately her network of 132 people see this, some of whom will either buy or borrow this book and read it. Things go viral fast. Used to be that it takes years to build a reputation that takes only a moment to destroy. It seems these days you can build and destroy reputations (ok, marketing buzz) in minutes... and you can amplify this effect if there are groups set up around certain themes, because then the uptake for anything posted into this forum would be very high, and you could get a large spike in volumes (either sales or whatever else, depending on media) very quickly.


6. The Human Unchained
Most importantly of all, unleash the human within your consumer. Exploit their need to feel heard, to talk, to feel good about what they have found that makes them smarter, wiser, more beautiful, special in their own way when they compare themselves with those around them. Encourage them to "like" products. Use them to describe their experiences using these products. Then mine these descriptions for statistically improbable phrases (SIPs - see the post on tf-idf for more) that more likely and more readily associate themselves with particular products. Use these words or phrases to help you with your thematic match (model 4 above).


Captchas were a great invention to prevent bots from harvesting Internet information for unscrupulous use. However, crafty websites can leverage human intelligence by requiring humans to solve a "flow through" captcha to unlock information from the original website for bot consumption. These and related ideas can be used both for good and bad... so caution is warranted. 


Good implementations will mine data in the aggregate, or, if they use data tied to particular individuals, will irrevocably destroy the link between the identifying characteristics of the individual in question and the data set that pertains to them. Either of these conditions must be met for safe implementations of data mining for user-centric analytics.


Other examples of human factors use in similar systems: 


In India, the police have taken to using Facebook to permit people on the road to report traffic violations involving other people's vehicles. Photographs or video of offenders can be posted online via Facebook or other social media sites, and the law enforcement machinery will take action as and where deemed necessary. Be careful, Big Brother is watching you, and your big brothers are your fellow citizens - in a more than figurative fraternal sense.


Want to hear a weather report for free? Call us, then hear an ad and tell us what you think, service is "free".


7. Putting it all together
As we have remarked over the course of this discussion, some elements require less data and less user participation than others, and can be built more easily. Then, as one's business grows, one can leverage the larger data set gleaned from the user base to build more sophisticated analytics to better exploit their information to provide better service to increase loyalty, lock in these customers by increasing barries to switching, or provide deals tailored to consumer segments that make it more attractive for them to return time and again.


Sadly, building these kinds of systems, even rudimentary ones, requires a rather substantial amount of data, and this data is not easily available to people like us who are interested in exploring the leading edges of technology, so for now, we close this post with out a sample code while expressing the fond hope that when such data becomes available more freely (after all, it is only aggregate data, doesn't really violate privacy), we might revisit the discussion to provide a sample implementation of at least some of the above.

Monday, January 16, 2012

Genetic Algorithms

Conway's game of life simulates the behavior of a colony of living beings represented by pixels in a plane using a set of simple rules. It is amazing the complex interactions that result from such simple implementations. The Langton Ant is another such creation (for those interested, there is a not too simple Project Euler problem involving this). These examples of what we might generally interpret as "artificial life" are interesting, but can ideas like this be used to solve real problems?


Genetic Algorithms (GA for short henceforth) emulate the propagation of life in an attempt to leverage randomness and simple rules of reproduction and mutation to solve difficult problems more easily in some cases than more traditional techniques. These belong to a class of what are commonly referred to as Randomized Algorithms, which includes mathematical approximations of other physical phenomena as simulated annealing for example ("annealing" is the science of gradually cooling glass from its molten state when it is blown, to room temperature without fracturing it in the process). 


Genetic Algorithms have various applications in solving complex dynamic optimization problems, etc, but to illustrate the principles and have some fun in the process, we use these to solve simple Sudoku puzzles in our implementation below. As almost everyone knows, Sudoku is a mathematical puzzle whereby one is presented with a 9x9 grid of cells, some with numbers filled in. This grid is overlaid by a 3x3 grid of cells where each of the 3x3 grids is composed of the smaller 3x3 cells. The goal is to fill in the entire grid with numbers subject to the condition that every 3x3 grid uses the numbers 1 through 9 only once, with the same condition applying to each row and each column of the original 9x9 grid as well. There are thus a total of 9+9+9=27 conditions to the problem, and we can solve this effectively through constrained optimization or through creative use of GAs.


One issue with GAs is that given they are a randomized process, the time to convergence to a complete solution cannot be guaranteed. We see this with our implementation - simple Sudoku puzzles are solved easily by our program, but for the more devious or "evil" level ones, the program thrashes around trying to find a solution, but convergence is very slow. Performance can be improved a lot by building a hybrid method - i.e. employ mathematical or other logical heuristics to fill in some other cells in the harder problems before handing it off to the GAs to solve, to ensure faster convergence, but we do not do that here because our focus is on building a sample implementation of GAs, not boiling the ocean. So, let's stick to the idea at hand, and get right to solving the problem.


How to solve Sudoku with GAs? Here is the approach we will use:

  1. generate a "board" in string form. this is a string that is 81 characters long, with numbers where there are numbers, and "x" elsewhere to denote empty spaces. a correctly solved board should have all digits from 1 to 9 exactly 9 times, so the sum of the digits in the string in a solved board must always equal (1+2+...+9)*9 =45*9=405.
  2. "unfold" the string to gather all constraints into a sequential string - this string now has 81+81+81 characters (the first 81 indicates the row constraints, the second 81, the column constraints, and the last 81, the block constraints) for a total of 243 characters.
  3. generate random "chromosomes", let's say 1000 at a time for our base gene pool. these chromosomes fill in the "x" locations from our base gene in (2) with numbers from 1 through 9, randomly. collect these chromosomes. collect only those chromosomes that meet the requirements of a proper solution - let's say the sum of all digits within should be 405. more stringent conditions here will lead to better chromosomes that take longer to compute. lenient conditions here will generate worse chromosomes that are computed quickly.
  4. rank the chromosomes in the gene pool based on how well they approximate what might be a correct solution to the Sudoku problem - are all digits used 9 times each, none more, none less? is the grid sum close to 405? what conditions we use here in the "fitness" constraint depend on what conditions we used in (3) above in the gene pool generation process.
  5. after the gene pool is created and chromosomes ranked for fitness, use Darwin's idea of "survival of the fittest" to keep only the top 25% of the pool, discarding the rest. then, use mutation and crossover to generate the next version of the pool.
  6. at each iteration of the pool, check to see if any chromosome meets all the criteria of the correct solution. if it does, then that chromosome is the correct solution to the puzzle and we stop there. if none does, keep iterating with more generations of the gene pool till a complete solution is found.
  7. print the final solution and exit.


Before going too far though, we first have to explain what mutation and crossover are. Mutation is a small change in the chromosome. Some small subset of bits can be changed arbitrarily - i.e. some digits may be randomly changed to other digits in one or more locations in the chromosome. Crossover is a simulation of sexual reproduction. We take two chromosomes (call them "parents") and generate two off-spring from them by slicing the parents into parts and splicing these parts together keeping in mind the required length constraints of the generated off-spring. Recall from (5) above that we only allow the top 5% of the gene pool to reproduce. What % of the gene pool is newly randomly generated, vs. generated by mutation or by crossover is a design choice, and some tweaking here with a view to optimizing run time, could greatly enhance the performance of the GA implementation.


Code follows along with a sample input and output file. A couple of optimizations can be made to the code below to make it even more efficient. For instance, each new chromosome can be generated by first filling in the empty spaces with the smallest number of options available, then moving on to the harder empty cells, just like humans solve the puzzle. This is left as an exercise to the reader :-)



import os, sys, operator;
from itertools import *;
from random import *;


def getCol(x,n): # get column n from grid x
 r=[];
 r+=[i[n] for i in x];
 return r;


def mkInt(c): # convert all numbers into ints in a list
 r=[];
 for i in c: r+=[[int(j) if j!='x' else j for j in i]];
 return r;


def genConstraints(x):
 # collect the constraints now
 constraints=[i for i in x]; # horizontal constraints
 for i in range(9): constraints+=[getCol(x,i)]; # vertical constraints
 # finally, the block constraints
 b=[[0,1,2],[3,4,5],[6,7,8]];
 for i in product([0,1,2],repeat=2):
  p,q=i;
  #print "p,q => ",p,q;
  c=[];
  for i in b[p]:
   t=[];
   for j in b[q]: t+=[x[i][j]];
   #print t;
   c+=t;
  constraints+=[c];
 # now we have all 27 constraints: 9 horizontal, 9 vertical, 9 block
 return mkInt(constraints);


def mkGrid(s): # convert a puzzle string to a grid
 r=[];
 for i in range(9): r+=[s.split(",")[i*9:(i+1)*9]];
 return r;


def grabDigit(x): # get a random digit from a list of digits
 shuffle(x);
 return x[0];


def genChr(x): # generate a chromosome
 r=[];
 for j in range(9):
  nums=[int(k) for k in x[j] if k!='x'];
  missing_nums=list(set(range(1,10)).difference(set(nums)));
  t=[];
  for i in range(9): 
   if x[j][i]!='x': t+=[int(x[j][i])]
   else:
    m=list(set(fx[(j,i)]).intersection(set(missing_nums)));
    if len(m)==0: m=fx[(j,i)];
    a=grabDigit(m); 
    t+=[a];
    if a in missing_nums: missing_nums.remove(a);
  r+=[t];
 return r;


def validChr(x): # is it a valid chromosome? not used for efficiency
 #for i in range(9): 
 # p=getCol(x,i);
 # if sum(p)!=45: return False;
 # if list(set(p))!=9: return False;
 return True;


def scoreChr(x): # score the chromosome
 t=genConstraints(x);
 score=0;
 for i in t: 
  if len(list(set(i)))==9: score+=1;
  #if sum(i)==45: score+=1;
 return score;
  
def printGrid(x): # print the grid 
 for i in x: 
   print " ".join(map(str,i));
 return;


def xover(a,b): # cross-over two chromosomes, get two descendants
 splice_loc=randint(1,7);
 ta=a[:splice_loc]+b[splice_loc:];
 tb=b[:splice_loc]+a[splice_loc:];
 return (ta,tb);


def xover2(a,b): # another method for cross-over
 speed=randint(1,6);
 ta,tb=a[:],b[:];
 for j in range(speed):
  x_loc=randint(1,7);
  t=ta[x_loc]; 
  ta[x_loc]=tb[x_loc];
  tb[x_loc]=t;
 return (ta,tb);


def mutate(a): # mutation
 speed=randint(1,6); # how many mutations in one go
 for i in range(speed):
  gene_seg=randint(0,8); # which gene segment to change
  x=genChr(g);
  a[gene_seg]=x[gene_seg];
 return a;   


p=open("s4.txt","r"); # open the puzzle file
pc=p.readlines(); # read it
pc=[i.replace("\n","") for i in pc];
pstr=",".join([i for i in pc]);
g=mkGrid(pstr);
mx=genConstraints(g);


# collect all the constraints by cell... 
# find which numbers might fit in each cell. use it for
b=[[0,1,2],[3,4,5],[6,7,8]];
cx,fx={},{};
for i in range(0,9):
 for j in range(0,9):
  for k in range(0,3):
   if i in b[k]: x=k;
   if j in b[k]: y=k;
   block_id=x*3+y;
   cx[(i,j)]=(i,9+j,18+block_id);
   fx[(i,j)]=list(set(range(1,10)).difference(set(mx[i]).union(set(mx[9+j])).union(set(mx[18+block_id]))));
# now fx is a dict that contains all possible values for each empty cell
t=fx.values();
for i in range(len(t)):
 if len(t[i])==1: 
  posx,posy=fx.keys()[i];
  if g[posx][posy]=='x': g[posx][posy]=t[i][0];


# all constraint strings must have all digits from 1 to 9 just once
# all constraint strings must add up to 9*10/2=45 
poolsz=5000; # size of gene pool
gene_pool=[];
while len(gene_pool)!=poolsz: # generate chromosomes to fill pool
 seed();
 t=genChr(g);
 while not validChr(t): t=genChr(g);
 if t not in gene_pool: gene_pool+=[t];


seen=gene_pool[:]; # keeps track of all seen chromosomes


# set the parameters for new generations
retain_pct=0.05;
mutate_pct=0.35;
xover_pct=0.55;
new_dna=1-retain_pct-mutate_pct-xover_pct;
generation,solved=0,0;
scores=[]; # store best scores in each generation


# solve the puzzle
while solved==0:
 genes=[];
 for i in gene_pool: genes+=[(i,scoreChr(i))];
 genes=sorted(genes,key=operator.itemgetter(1),reverse=True);
 if genes[0][1]==27: 
  print "puzzle solved";
  printGrid(genes[0][0]);
  solved=1;
 else:
  if len(scores)>3 and len(list(set(scores[-3:])))==1: 
   seed(); 
   jumpahead(randint(1,100000));
   poolsz=randint(poolsz,poolsz+500);
   retain_pct=0.1;
   mutate_pct=round(uniform(0.25,0.3),2);
   xover_pct=0.95-mutate_pct;
   new_dna=1-retain_pct-mutate_pct-xover_pct;
  print "generation: ",generation,"pool size: ",len(genes)," max score: ",genes[0][1];
  #printGrid(genes[0][0]);
  scores+=[genes[0][1]];
  generation+=1;
  seed();
  chr_retained=int(poolsz*retain_pct);
  new_pool=[j[0] for j in genes[:chr_retained]]; # 250 genes to retain
  for k in range(int(xover_pct*poolsz/2.)): # generate 450 cross-over genes
   seed();
   a,b=new_pool[randint(0,len(new_pool)-1)][:],new_pool[randint(0,len(new_pool)-1)][:];
   toss=randint(0,1);
   if toss==0: ta,tb=xover(a,b);
   else: ta,tb=xover2(a,b);
   if ta not in seen: new_pool+=[ta];
   if tb not in seen: new_pool+=[tb];
  for k in range(int(mutate_pct*poolsz)): # generate 150 mutated genes
   seed();
   a=new_pool[randint(0,len(new_pool)-1)][:];
   ta=mutate(a);
   if ta not in seen: new_pool+=[ta];
  while len(new_pool)<poolsz: # new genetic material
   seed();
   ta=genChr(g);
   if ta not in seen: new_pool+=[ta];
  gene_pool=new_pool[:];
  seen+=gene_pool[:];


sys.exit(0);


  
Input File: 
7,9,x,x,x,x,3,x,x
x,x,x,x,x,6,9,x,x
8,x,x,x,3,x,x,7,6
x,x,x,x,x,5,x,x,2
x,x,5,4,1,8,7,x,x
4,x,x,7,x,x,x,x,x
6,1,x,x,9,x,x,x,8
x,x,2,3,x,x,x,x,x
x,x,9,x,x,x,x,5,4




Output File:
generation:  0 pool size:  5000  max score:  12
generation:  1 pool size:  5000  max score:  15
generation:  2 pool size:  5000  max score:  17
generation:  3 pool size:  5000  max score:  19
generation:  4 pool size:  5000  max score:  22
generation:  5 pool size:  5000  max score:  25
generation:  6 pool size:  5000  max score:  25
generation:  7 pool size:  5000  max score:  25


puzzle solved
7 9 6 8 5 4 3 2 1
2 4 3 1 7 6 9 8 5
8 5 1 2 3 9 4 7 6
1 3 7 9 6 5 8 4 2
9 2 5 4 1 8 7 6 3
4 6 8 7 2 3 5 1 9
6 1 4 5 9 7 2 3 8
5 8 2 3 4 1 6 9 7
3 7 9 6 8 2 1 5 4









Sunday, January 15, 2012

Twitter Sentiment Analysis

In this day and age of instant communication, currency as well as correctness of data exchanged or published has increased in importance. In the old days, clocks used to have just the hour hand... now even the second hand appears to some to move too slowly. We have advanced to a state where every second, and sometimes milliseconds, and even microseconds matter.

In such a world, as newer generations of people are growing up in a world where communication is ubiquitous and they are less inhibited about sharing personal details on the web, commercial consumers of such information also want the ability to parse and analyze this content in aggregate in order to profit from it.

For example, in media like Twitter - the popular micro-blogging site - people spread out across the world can report on local happenings almost instantaneously, way before news-channels pick up the stories, and since large numbers of people tend to micro-blog about any event (and the number of people doing this will only increase over time), consumers can rely on "the wisdom of the crowds" to accurately aggregate correct information (these tweets "go viral") while seeing false stories dissipate without really catching on.

We use Twitter merely as an example of a dynamic and evolving medium that quickly tracks public sentiment. We could just as easily have used Facebook or Google+ or any such similar medium for that matter, to similar effect.

So what are some applications of twitter sentiment analysis? Some firms use sentiment analysis as a signal for investing in emerging market or frontier market economies. Others use this as a means of active marketing or even to determine the effectiveness or efficacy of a particular marketing strategy (think micro- as well as macro- targeting). Political pundits might use this as a basis for predicting popular sentiment going into an election. Strategists might use it to determine what issues really touch the masses and focus on positioning their candidates better in those areas. If geo-location data was attached to each tweet (currently it is an optional field and only very few tweets have this data available), one can even determine how geographical factors play into people's opinions. In addition, one can form an opinion as to the "distance" between a tweeter on the ground and the incident s/he is reporting about, lending more credibility to eye-witnesses. Several other applications of tweets exist, and these will only become more obvious over time as the micro-blogging medium matures.

An interface into twitter that can collect the tweets in a timely manner and then analyze them to collect the sentiment of the environment or the people as a whole with regard to any particular subject matter in aggregate becomes useful. This is what is popularly referred to as "sentiment analysis". We implement a very simple twitter feed sentiment analyzer in what follows. As with all our other examples, we build this in Python 2.x. We employ a very simple algorithm for sentiment analysis as described below. A more detailed discussion of what people are doing in this field is here, here, here, and here. And of course, while you're at it, not a bad idea to look at the Wikipedia article on the same.

Let us now consider what an implementation involves.
  1. First, we need to connect to, and read a twitter field, and filter it for any particular topic of interest. To do this, we utilize the excellent Open Source Twython API available for free download on the Internet. Twython has extensive documentation with examples that enables us to easily connect to a twitter feed and search on particular terms.
  2. Next, we generate a list of sentiment indicators. To do this, we may look at a sample of tweets, then decide what words indicate positive and negative sentiments, and collect them for use in checks. Tweets that have more negative words than positive words may be considered negative tweets and vice-versa (our analysis ignores satirical or sarcastic comments, these are mis-classified in the first implementation).
  3. We read the sentiment indicators and use them to parse and classify the tweets from (1) above. Next, we compute the number of negative and positive tweets as a percentage of the total. In the current implementation we only have a set of negative keywords and treat all non-negative tweets - i.e. tweets without at least one negative word in them - as positive tweets. In later versions we will expand on this to classify our tweets into three buckets: positive, negative, and neutral.
  4. Finally, we write our computed percentages of positive and negative tweets to file so we can study the temporal evolution of sentiment (i.e. evolution of sentiment regarding that topic over time).
Code follows below, with output underneath:
# twitter sentiment analysis 

# (c) Phantom Phoenix, The Medusa's Cave Blog 2012
# Description: This file performs simple sentiment analysis on global twitter feeds
#              Later versions may improve on the sentiment analysis algorithm


import os, sys, time; # import standard libraries
from twython import Twython; # import the Twitter Twython module
twitter = Twython(); # initialize it


f=open("neg.txt","r"); # open the file with keywords for sentiment analysis
fc=f.readlines(); # read it
wrds=[i.replace("\n","") for i in fc]; # save it for use
g=open("twit.txt","w"); # open the output file
g.write("@time, #tweets, sentiment index: % positive, % negative\n"); # write header to it
while True: # forever...
 search_results = twitter.searchTwitter(q=sys.argv[1], rpp="500");
 # search twitter feeds for the specified topic of interest with 
 # max results per page of 500.
 x=search_results["results"]; # grab results from search
 r=[]; # create placeholder to hold tweets
 for i in x: 
  t=i["text"];
  txt="".join([j for j in t if ord(j)<128]);
  # parse tweets and gather those not directed to particular users
  if txt.find("@")==-1: r+=[txt]; 


 neg=0; # set counter for negative tweets
 for i in r:
  for j in wrds: # check against word list, calculate how many negatives
   if i.lower().find(j)>-1:
    break;
  if j=="": 
   #print "pos,",i; # treating non-negatives as positives in this iteration
   pass;            # this may change later as we evolve to support three categories
  else: 
   #print "neg,",i;
   neg+=1;


 #print "number of negatives: ",neg;
 #print "sentiment index: positive: %5.2f%% negative: %5.2f%%" %((len(r)-neg)/float(len(r))*100,neg/float(len(r))*100);
 #print ",".join([time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),str(len(r)),str((len(r)-neg)/float(len(r))*100)[:5]+"%",str(neg/float(len(r))*100)[:5]+"%"]);
 g.write(",".join([time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),str(len(r)),str((len(r)-neg)/float(len(r))*100)[:5]+"%",str(neg/float(len(r))*100)[:5]+"%"])+"\n");
 g.flush(); # write output to file, flush it, then sync it to disk immediately.
 os.fsync(g.fileno());
 time.sleep(180); # sleep for 3 mins then try again

g.close(); # close file after forever and exit program
sys.exit(0); # these two lines never reached but kept for completeness

Output:

@time, #tweets, sentiment index: % positive, % negative
2012-01-16 01:40:38,69,55.07%,44.92%
2012-01-16 01:43:38,58,60.34%,39.65%
2012-01-16 01:46:39,59,57.62%,42.37%
2012-01-16 01:49:40,48,64.58%,35.41%
2012-01-16 01:52:40,52,59.61%,40.38%
2012-01-16 01:55:41,52,67.30%,32.69%
2012-01-16 01:58:42,50,72.0%,28.0%