Saturday, January 21, 2012

On the design of Recommender Systems

Recommender Systems or Recommendation Systems (henceforth RS) are the systems used in many popular websites today to drive up traffic, sales, and overall customer satisfaction. These are used successfully in many Internet Services websites like Amazon, Netflix etc., to generate more revenue. When implemented and used correctly, these systems can massively impact business metrics driving up growth, revenues, and yes, even profitability (though this last metric also depends on other business parameters). In this blog post, we study some design factors involved in the construction of such systems.


Disclaimer: we do not claim any of these ideas below are original. We will try and add references later for some of the more important ideas, as time permits, to help the interested reader along a voyage of discovery.


1. Frequent Itemsets
This idea comes from the field of marketing analytics. Did you know that beer and baby diaper sales in the US are highly correlated? Apparently when dads go (or are sent) to the nearby store to pick up baby diapers, they take it upon themselves to restock on their beer to make better use of the trip, help save the environment (drive to store once, buy more things ...), etc. Stores that have customer loyalty programs can pick up on these kinds of trends more easily, since they know the content of most "shopping baskets", and can decide which sets of items (called frequent itemsets), go together most often, and are purchased most frequently. This way, they can decide to upon appropriate marketing strategies to maximize their revenue - e.g. have sales on both beer and diapers at the same time, or say give out larger discounts on diapers than neighboring stores, but hike up the price of beer to compensate etc.,


Finding and tracking frequent itemsets is something of an art form, and requires data mining and statistical analysis. Different firms do this differently with different levels of sophistication, and more complex methods are not necessarily better all the time...


Online stores can capitalize on this idea in other different and interesting ways. An RS for Amazon or other online store can for example direct people who have purchased certain items, to other items people with similar purchasing histories may have bought or may have considered buying (e.g. by tracking their page views). Here, the website is using information about customers and prospective customers to guide others along the same path. 


The logic goes: "Buying a toy? Perhaps batteries are needed. 95% of people buying this toy also bought batteries at the same time. And oh, 40% of those buying this toy wanted this cool rubber casing with it too. And did you consider how much nicer this expensive toy would be with this mod so many others like to buy? ... " and so on. 


Or... like "Cool Hand Luke"? Maybe you will like other movies that people who liked this movie will like? Perhaps it is more likely you will like movies that others liked if they also liked most movies that you say you liked, and less of the movies you say you do not like? (this is where the user ratings come in handy at Netflix, so your "basket" is the set of all movies you have rated, and can be compared with other baskets to find those that have similar characteristics to locate those that have other items your's does not have, and recommend these as movies to watch).


Not too intrusive (not overtly anyway), but helpfully plugging just the right products for you to find. And of course, there is a search facility that lets you find not just what you need, but multiple options for similar things, all neatly indexed in decreasing order of potential relevance, with the option for you to sort things along other parameters you might like.


2. The Wisdom of Crowds
Ever wonder what makes Amazon.com so much more successful than other book stores with an online presence? A key factor is the facility that enables customers to share their experiences about a product or service sold through the store-front. For instance, if on site A, I can read all about a book I want to buy, see other customers' reviews of the same, see customer posted product pictures, also see other vendors' prices with the ability to comparison shop, and share my feedback either positive or negative with the larger body of consumers, why would I not pick this site over competitors', especially if I can also get better deals there?


MBAs are usually taught that some platforms are "multi-sided markets" (think credit cards, or development APIs for example). Developing such platforms from the ground up can be a challenge, because you need to simultaneously grow all communities around the platform. But once there is sufficient interest and a core ecosystem becomes functional, it takes off in a virtuous cycle and things can grow very rapidly and the business takes off. Of course, in an environment with increasing "clock-speed" (reduced cycle times), successes and failures typically become evident very quickly as well. Agile companies can just as quickly increase investment in ideas that seem to work, and kill those that seem to be doing badly. But we digress...


With the advent of media like Facebook and Twitter, many websites now permit users to share their purchases with friends from those and other communities. This creates a following... "I am more likely to like things my friends like, so perhaps I should check out the things they are buying..." and so on. This is kind of a subliminal recommendation system. Works on the sly. See model (5) below for more.


Wikipedia is, to date, perhaps the best positive example of the wisdom of crowds. But even there, the truly wise are very few... and there are far far more readers than writers. The same applies to product ratings. Most people will at best, only rate a product. People that write reviews will be a much smaller number. And those that do this without a hidden motive (there have been several reported instances of restaurants reviewing themselves over Yelp!, of authors reviewing their own books glowingly on Amazon etc) would be an even smaller subset.


3. Free Samples for All
This builds on ideas from (1) above. So much better to give people the option to "try before you buy". Consumption of books, movies, songs etc. require an investment of time, money, effort on part of the consumer. Can you make them understand that the 2 hrs it takes to watch this movie, the $12.95 it takes to buy it, is actually worth it, as opposed to, say doing something else with that money and time? Give them a free preview. Tell them "no pressure, we think this is something you'd like, and as you can see, it is really not that expensive. Check out how much other people are selling it for. And look how much other people like you seem to like it! just buy it now from us, and don't worry about it till it is time to pay your credit card bill". That's what free samples do. Of course, occasionally you will have free-loaders that will take your samples and never buy, but that's a chance you have to take.


4. A Thematic Match
Sometimes, it is also convenient and useful for systems to generate a thematic match. If you like certain movies for example, rather than look for movies that others who liked your favorites have watched (which requires gobs of data), why not recommend other movies with similar thematic elements that will in all likelihood interest you? Or perhaps you liked the movies you liked because of the actor or actors in them? So other movies with the same "elements" - defined as actors, story lines etc that are similar might interest you? Of course to find exactly what makes someone like a certain movie requires some deep statistical analysis of what similar elements exist across their viewing history, or (here's a novel idea!) asking them to tell you what kinds of movies they like. "Do you like to watch mysteries frequently? sometimes? never? What about dark, gritty, thrillers?" 


Thematic match is an interesting scheme that can get results quickly with more information from the public domain, and some patient customer prodding. Easier to get started with this, build a community of loyal users, then exploit the data they leave behind as they engage more actively with your website giving you more data to mine with models (1) and (2) above, giving you more sales, encouraging your content suppliers to give you more content with free samplers - model (3) - (now that your ecosystem is humming along), and so on...


5. Link to Social Media
Think also of social media used to create a buzz. This is a key new development in our age. "OMG! I just read Girl With The Dragon Tattoo and loved it!" someone tweets or posts on Facebook. And immediately her network of 132 people see this, some of whom will either buy or borrow this book and read it. Things go viral fast. Used to be that it takes years to build a reputation that takes only a moment to destroy. It seems these days you can build and destroy reputations (ok, marketing buzz) in minutes... and you can amplify this effect if there are groups set up around certain themes, because then the uptake for anything posted into this forum would be very high, and you could get a large spike in volumes (either sales or whatever else, depending on media) very quickly.


6. The Human Unchained
Most importantly of all, unleash the human within your consumer. Exploit their need to feel heard, to talk, to feel good about what they have found that makes them smarter, wiser, more beautiful, special in their own way when they compare themselves with those around them. Encourage them to "like" products. Use them to describe their experiences using these products. Then mine these descriptions for statistically improbable phrases (SIPs - see the post on tf-idf for more) that more likely and more readily associate themselves with particular products. Use these words or phrases to help you with your thematic match (model 4 above).


Captchas were a great invention to prevent bots from harvesting Internet information for unscrupulous use. However, crafty websites can leverage human intelligence by requiring humans to solve a "flow through" captcha to unlock information from the original website for bot consumption. These and related ideas can be used both for good and bad... so caution is warranted. 


Good implementations will mine data in the aggregate, or, if they use data tied to particular individuals, will irrevocably destroy the link between the identifying characteristics of the individual in question and the data set that pertains to them. Either of these conditions must be met for safe implementations of data mining for user-centric analytics.


Other examples of human factors use in similar systems: 


In India, the police have taken to using Facebook to permit people on the road to report traffic violations involving other people's vehicles. Photographs or video of offenders can be posted online via Facebook or other social media sites, and the law enforcement machinery will take action as and where deemed necessary. Be careful, Big Brother is watching you, and your big brothers are your fellow citizens - in a more than figurative fraternal sense.


Want to hear a weather report for free? Call us, then hear an ad and tell us what you think, service is "free".


7. Putting it all together
As we have remarked over the course of this discussion, some elements require less data and less user participation than others, and can be built more easily. Then, as one's business grows, one can leverage the larger data set gleaned from the user base to build more sophisticated analytics to better exploit their information to provide better service to increase loyalty, lock in these customers by increasing barries to switching, or provide deals tailored to consumer segments that make it more attractive for them to return time and again.


Sadly, building these kinds of systems, even rudimentary ones, requires a rather substantial amount of data, and this data is not easily available to people like us who are interested in exploring the leading edges of technology, so for now, we close this post with out a sample code while expressing the fond hope that when such data becomes available more freely (after all, it is only aggregate data, doesn't really violate privacy), we might revisit the discussion to provide a sample implementation of at least some of the above.

Monday, January 16, 2012

Genetic Algorithms

Conway's game of life simulates the behavior of a colony of living beings represented by pixels in a plane using a set of simple rules. It is amazing the complex interactions that result from such simple implementations. The Langton Ant is another such creation (for those interested, there is a not too simple Project Euler problem involving this). These examples of what we might generally interpret as "artificial life" are interesting, but can ideas like this be used to solve real problems?


Genetic Algorithms (GA for short henceforth) emulate the propagation of life in an attempt to leverage randomness and simple rules of reproduction and mutation to solve difficult problems more easily in some cases than more traditional techniques. These belong to a class of what are commonly referred to as Randomized Algorithms, which includes mathematical approximations of other physical phenomena as simulated annealing for example ("annealing" is the science of gradually cooling glass from its molten state when it is blown, to room temperature without fracturing it in the process). 


Genetic Algorithms have various applications in solving complex dynamic optimization problems, etc, but to illustrate the principles and have some fun in the process, we use these to solve simple Sudoku puzzles in our implementation below. As almost everyone knows, Sudoku is a mathematical puzzle whereby one is presented with a 9x9 grid of cells, some with numbers filled in. This grid is overlaid by a 3x3 grid of cells where each of the 3x3 grids is composed of the smaller 3x3 cells. The goal is to fill in the entire grid with numbers subject to the condition that every 3x3 grid uses the numbers 1 through 9 only once, with the same condition applying to each row and each column of the original 9x9 grid as well. There are thus a total of 9+9+9=27 conditions to the problem, and we can solve this effectively through constrained optimization or through creative use of GAs.


One issue with GAs is that given they are a randomized process, the time to convergence to a complete solution cannot be guaranteed. We see this with our implementation - simple Sudoku puzzles are solved easily by our program, but for the more devious or "evil" level ones, the program thrashes around trying to find a solution, but convergence is very slow. Performance can be improved a lot by building a hybrid method - i.e. employ mathematical or other logical heuristics to fill in some other cells in the harder problems before handing it off to the GAs to solve, to ensure faster convergence, but we do not do that here because our focus is on building a sample implementation of GAs, not boiling the ocean. So, let's stick to the idea at hand, and get right to solving the problem.


How to solve Sudoku with GAs? Here is the approach we will use:

  1. generate a "board" in string form. this is a string that is 81 characters long, with numbers where there are numbers, and "x" elsewhere to denote empty spaces. a correctly solved board should have all digits from 1 to 9 exactly 9 times, so the sum of the digits in the string in a solved board must always equal (1+2+...+9)*9 =45*9=405.
  2. "unfold" the string to gather all constraints into a sequential string - this string now has 81+81+81 characters (the first 81 indicates the row constraints, the second 81, the column constraints, and the last 81, the block constraints) for a total of 243 characters.
  3. generate random "chromosomes", let's say 1000 at a time for our base gene pool. these chromosomes fill in the "x" locations from our base gene in (2) with numbers from 1 through 9, randomly. collect these chromosomes. collect only those chromosomes that meet the requirements of a proper solution - let's say the sum of all digits within should be 405. more stringent conditions here will lead to better chromosomes that take longer to compute. lenient conditions here will generate worse chromosomes that are computed quickly.
  4. rank the chromosomes in the gene pool based on how well they approximate what might be a correct solution to the Sudoku problem - are all digits used 9 times each, none more, none less? is the grid sum close to 405? what conditions we use here in the "fitness" constraint depend on what conditions we used in (3) above in the gene pool generation process.
  5. after the gene pool is created and chromosomes ranked for fitness, use Darwin's idea of "survival of the fittest" to keep only the top 25% of the pool, discarding the rest. then, use mutation and crossover to generate the next version of the pool.
  6. at each iteration of the pool, check to see if any chromosome meets all the criteria of the correct solution. if it does, then that chromosome is the correct solution to the puzzle and we stop there. if none does, keep iterating with more generations of the gene pool till a complete solution is found.
  7. print the final solution and exit.


Before going too far though, we first have to explain what mutation and crossover are. Mutation is a small change in the chromosome. Some small subset of bits can be changed arbitrarily - i.e. some digits may be randomly changed to other digits in one or more locations in the chromosome. Crossover is a simulation of sexual reproduction. We take two chromosomes (call them "parents") and generate two off-spring from them by slicing the parents into parts and splicing these parts together keeping in mind the required length constraints of the generated off-spring. Recall from (5) above that we only allow the top 5% of the gene pool to reproduce. What % of the gene pool is newly randomly generated, vs. generated by mutation or by crossover is a design choice, and some tweaking here with a view to optimizing run time, could greatly enhance the performance of the GA implementation.


Code follows along with a sample input and output file. A couple of optimizations can be made to the code below to make it even more efficient. For instance, each new chromosome can be generated by first filling in the empty spaces with the smallest number of options available, then moving on to the harder empty cells, just like humans solve the puzzle. This is left as an exercise to the reader :-)



import os, sys, operator;
from itertools import *;
from random import *;


def getCol(x,n): # get column n from grid x
 r=[];
 r+=[i[n] for i in x];
 return r;


def mkInt(c): # convert all numbers into ints in a list
 r=[];
 for i in c: r+=[[int(j) if j!='x' else j for j in i]];
 return r;


def genConstraints(x):
 # collect the constraints now
 constraints=[i for i in x]; # horizontal constraints
 for i in range(9): constraints+=[getCol(x,i)]; # vertical constraints
 # finally, the block constraints
 b=[[0,1,2],[3,4,5],[6,7,8]];
 for i in product([0,1,2],repeat=2):
  p,q=i;
  #print "p,q => ",p,q;
  c=[];
  for i in b[p]:
   t=[];
   for j in b[q]: t+=[x[i][j]];
   #print t;
   c+=t;
  constraints+=[c];
 # now we have all 27 constraints: 9 horizontal, 9 vertical, 9 block
 return mkInt(constraints);


def mkGrid(s): # convert a puzzle string to a grid
 r=[];
 for i in range(9): r+=[s.split(",")[i*9:(i+1)*9]];
 return r;


def grabDigit(x): # get a random digit from a list of digits
 shuffle(x);
 return x[0];


def genChr(x): # generate a chromosome
 r=[];
 for j in range(9):
  nums=[int(k) for k in x[j] if k!='x'];
  missing_nums=list(set(range(1,10)).difference(set(nums)));
  t=[];
  for i in range(9): 
   if x[j][i]!='x': t+=[int(x[j][i])]
   else:
    m=list(set(fx[(j,i)]).intersection(set(missing_nums)));
    if len(m)==0: m=fx[(j,i)];
    a=grabDigit(m); 
    t+=[a];
    if a in missing_nums: missing_nums.remove(a);
  r+=[t];
 return r;


def validChr(x): # is it a valid chromosome? not used for efficiency
 #for i in range(9): 
 # p=getCol(x,i);
 # if sum(p)!=45: return False;
 # if list(set(p))!=9: return False;
 return True;


def scoreChr(x): # score the chromosome
 t=genConstraints(x);
 score=0;
 for i in t: 
  if len(list(set(i)))==9: score+=1;
  #if sum(i)==45: score+=1;
 return score;
  
def printGrid(x): # print the grid 
 for i in x: 
   print " ".join(map(str,i));
 return;


def xover(a,b): # cross-over two chromosomes, get two descendants
 splice_loc=randint(1,7);
 ta=a[:splice_loc]+b[splice_loc:];
 tb=b[:splice_loc]+a[splice_loc:];
 return (ta,tb);


def xover2(a,b): # another method for cross-over
 speed=randint(1,6);
 ta,tb=a[:],b[:];
 for j in range(speed):
  x_loc=randint(1,7);
  t=ta[x_loc]; 
  ta[x_loc]=tb[x_loc];
  tb[x_loc]=t;
 return (ta,tb);


def mutate(a): # mutation
 speed=randint(1,6); # how many mutations in one go
 for i in range(speed):
  gene_seg=randint(0,8); # which gene segment to change
  x=genChr(g);
  a[gene_seg]=x[gene_seg];
 return a;   


p=open("s4.txt","r"); # open the puzzle file
pc=p.readlines(); # read it
pc=[i.replace("\n","") for i in pc];
pstr=",".join([i for i in pc]);
g=mkGrid(pstr);
mx=genConstraints(g);


# collect all the constraints by cell... 
# find which numbers might fit in each cell. use it for
b=[[0,1,2],[3,4,5],[6,7,8]];
cx,fx={},{};
for i in range(0,9):
 for j in range(0,9):
  for k in range(0,3):
   if i in b[k]: x=k;
   if j in b[k]: y=k;
   block_id=x*3+y;
   cx[(i,j)]=(i,9+j,18+block_id);
   fx[(i,j)]=list(set(range(1,10)).difference(set(mx[i]).union(set(mx[9+j])).union(set(mx[18+block_id]))));
# now fx is a dict that contains all possible values for each empty cell
t=fx.values();
for i in range(len(t)):
 if len(t[i])==1: 
  posx,posy=fx.keys()[i];
  if g[posx][posy]=='x': g[posx][posy]=t[i][0];


# all constraint strings must have all digits from 1 to 9 just once
# all constraint strings must add up to 9*10/2=45 
poolsz=5000; # size of gene pool
gene_pool=[];
while len(gene_pool)!=poolsz: # generate chromosomes to fill pool
 seed();
 t=genChr(g);
 while not validChr(t): t=genChr(g);
 if t not in gene_pool: gene_pool+=[t];


seen=gene_pool[:]; # keeps track of all seen chromosomes


# set the parameters for new generations
retain_pct=0.05;
mutate_pct=0.35;
xover_pct=0.55;
new_dna=1-retain_pct-mutate_pct-xover_pct;
generation,solved=0,0;
scores=[]; # store best scores in each generation


# solve the puzzle
while solved==0:
 genes=[];
 for i in gene_pool: genes+=[(i,scoreChr(i))];
 genes=sorted(genes,key=operator.itemgetter(1),reverse=True);
 if genes[0][1]==27: 
  print "puzzle solved";
  printGrid(genes[0][0]);
  solved=1;
 else:
  if len(scores)>3 and len(list(set(scores[-3:])))==1: 
   seed(); 
   jumpahead(randint(1,100000));
   poolsz=randint(poolsz,poolsz+500);
   retain_pct=0.1;
   mutate_pct=round(uniform(0.25,0.3),2);
   xover_pct=0.95-mutate_pct;
   new_dna=1-retain_pct-mutate_pct-xover_pct;
  print "generation: ",generation,"pool size: ",len(genes)," max score: ",genes[0][1];
  #printGrid(genes[0][0]);
  scores+=[genes[0][1]];
  generation+=1;
  seed();
  chr_retained=int(poolsz*retain_pct);
  new_pool=[j[0] for j in genes[:chr_retained]]; # 250 genes to retain
  for k in range(int(xover_pct*poolsz/2.)): # generate 450 cross-over genes
   seed();
   a,b=new_pool[randint(0,len(new_pool)-1)][:],new_pool[randint(0,len(new_pool)-1)][:];
   toss=randint(0,1);
   if toss==0: ta,tb=xover(a,b);
   else: ta,tb=xover2(a,b);
   if ta not in seen: new_pool+=[ta];
   if tb not in seen: new_pool+=[tb];
  for k in range(int(mutate_pct*poolsz)): # generate 150 mutated genes
   seed();
   a=new_pool[randint(0,len(new_pool)-1)][:];
   ta=mutate(a);
   if ta not in seen: new_pool+=[ta];
  while len(new_pool)<poolsz: # new genetic material
   seed();
   ta=genChr(g);
   if ta not in seen: new_pool+=[ta];
  gene_pool=new_pool[:];
  seen+=gene_pool[:];


sys.exit(0);


  
Input File: 
7,9,x,x,x,x,3,x,x
x,x,x,x,x,6,9,x,x
8,x,x,x,3,x,x,7,6
x,x,x,x,x,5,x,x,2
x,x,5,4,1,8,7,x,x
4,x,x,7,x,x,x,x,x
6,1,x,x,9,x,x,x,8
x,x,2,3,x,x,x,x,x
x,x,9,x,x,x,x,5,4




Output File:
generation:  0 pool size:  5000  max score:  12
generation:  1 pool size:  5000  max score:  15
generation:  2 pool size:  5000  max score:  17
generation:  3 pool size:  5000  max score:  19
generation:  4 pool size:  5000  max score:  22
generation:  5 pool size:  5000  max score:  25
generation:  6 pool size:  5000  max score:  25
generation:  7 pool size:  5000  max score:  25


puzzle solved
7 9 6 8 5 4 3 2 1
2 4 3 1 7 6 9 8 5
8 5 1 2 3 9 4 7 6
1 3 7 9 6 5 8 4 2
9 2 5 4 1 8 7 6 3
4 6 8 7 2 3 5 1 9
6 1 4 5 9 7 2 3 8
5 8 2 3 4 1 6 9 7
3 7 9 6 8 2 1 5 4









Sunday, January 15, 2012

Twitter Sentiment Analysis

In this day and age of instant communication, currency as well as correctness of data exchanged or published has increased in importance. In the old days, clocks used to have just the hour hand... now even the second hand appears to some to move too slowly. We have advanced to a state where every second, and sometimes milliseconds, and even microseconds matter.

In such a world, as newer generations of people are growing up in a world where communication is ubiquitous and they are less inhibited about sharing personal details on the web, commercial consumers of such information also want the ability to parse and analyze this content in aggregate in order to profit from it.

For example, in media like Twitter - the popular micro-blogging site - people spread out across the world can report on local happenings almost instantaneously, way before news-channels pick up the stories, and since large numbers of people tend to micro-blog about any event (and the number of people doing this will only increase over time), consumers can rely on "the wisdom of the crowds" to accurately aggregate correct information (these tweets "go viral") while seeing false stories dissipate without really catching on.

We use Twitter merely as an example of a dynamic and evolving medium that quickly tracks public sentiment. We could just as easily have used Facebook or Google+ or any such similar medium for that matter, to similar effect.

So what are some applications of twitter sentiment analysis? Some firms use sentiment analysis as a signal for investing in emerging market or frontier market economies. Others use this as a means of active marketing or even to determine the effectiveness or efficacy of a particular marketing strategy (think micro- as well as macro- targeting). Political pundits might use this as a basis for predicting popular sentiment going into an election. Strategists might use it to determine what issues really touch the masses and focus on positioning their candidates better in those areas. If geo-location data was attached to each tweet (currently it is an optional field and only very few tweets have this data available), one can even determine how geographical factors play into people's opinions. In addition, one can form an opinion as to the "distance" between a tweeter on the ground and the incident s/he is reporting about, lending more credibility to eye-witnesses. Several other applications of tweets exist, and these will only become more obvious over time as the micro-blogging medium matures.

An interface into twitter that can collect the tweets in a timely manner and then analyze them to collect the sentiment of the environment or the people as a whole with regard to any particular subject matter in aggregate becomes useful. This is what is popularly referred to as "sentiment analysis". We implement a very simple twitter feed sentiment analyzer in what follows. As with all our other examples, we build this in Python 2.x. We employ a very simple algorithm for sentiment analysis as described below. A more detailed discussion of what people are doing in this field is here, here, here, and here. And of course, while you're at it, not a bad idea to look at the Wikipedia article on the same.

Let us now consider what an implementation involves.
  1. First, we need to connect to, and read a twitter field, and filter it for any particular topic of interest. To do this, we utilize the excellent Open Source Twython API available for free download on the Internet. Twython has extensive documentation with examples that enables us to easily connect to a twitter feed and search on particular terms.
  2. Next, we generate a list of sentiment indicators. To do this, we may look at a sample of tweets, then decide what words indicate positive and negative sentiments, and collect them for use in checks. Tweets that have more negative words than positive words may be considered negative tweets and vice-versa (our analysis ignores satirical or sarcastic comments, these are mis-classified in the first implementation).
  3. We read the sentiment indicators and use them to parse and classify the tweets from (1) above. Next, we compute the number of negative and positive tweets as a percentage of the total. In the current implementation we only have a set of negative keywords and treat all non-negative tweets - i.e. tweets without at least one negative word in them - as positive tweets. In later versions we will expand on this to classify our tweets into three buckets: positive, negative, and neutral.
  4. Finally, we write our computed percentages of positive and negative tweets to file so we can study the temporal evolution of sentiment (i.e. evolution of sentiment regarding that topic over time).
Code follows below, with output underneath:
# twitter sentiment analysis 

# (c) Phantom Phoenix, The Medusa's Cave Blog 2012
# Description: This file performs simple sentiment analysis on global twitter feeds
#              Later versions may improve on the sentiment analysis algorithm


import os, sys, time; # import standard libraries
from twython import Twython; # import the Twitter Twython module
twitter = Twython(); # initialize it


f=open("neg.txt","r"); # open the file with keywords for sentiment analysis
fc=f.readlines(); # read it
wrds=[i.replace("\n","") for i in fc]; # save it for use
g=open("twit.txt","w"); # open the output file
g.write("@time, #tweets, sentiment index: % positive, % negative\n"); # write header to it
while True: # forever...
 search_results = twitter.searchTwitter(q=sys.argv[1], rpp="500");
 # search twitter feeds for the specified topic of interest with 
 # max results per page of 500.
 x=search_results["results"]; # grab results from search
 r=[]; # create placeholder to hold tweets
 for i in x: 
  t=i["text"];
  txt="".join([j for j in t if ord(j)<128]);
  # parse tweets and gather those not directed to particular users
  if txt.find("@")==-1: r+=[txt]; 


 neg=0; # set counter for negative tweets
 for i in r:
  for j in wrds: # check against word list, calculate how many negatives
   if i.lower().find(j)>-1:
    break;
  if j=="": 
   #print "pos,",i; # treating non-negatives as positives in this iteration
   pass;            # this may change later as we evolve to support three categories
  else: 
   #print "neg,",i;
   neg+=1;


 #print "number of negatives: ",neg;
 #print "sentiment index: positive: %5.2f%% negative: %5.2f%%" %((len(r)-neg)/float(len(r))*100,neg/float(len(r))*100);
 #print ",".join([time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),str(len(r)),str((len(r)-neg)/float(len(r))*100)[:5]+"%",str(neg/float(len(r))*100)[:5]+"%"]);
 g.write(",".join([time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),str(len(r)),str((len(r)-neg)/float(len(r))*100)[:5]+"%",str(neg/float(len(r))*100)[:5]+"%"])+"\n");
 g.flush(); # write output to file, flush it, then sync it to disk immediately.
 os.fsync(g.fileno());
 time.sleep(180); # sleep for 3 mins then try again

g.close(); # close file after forever and exit program
sys.exit(0); # these two lines never reached but kept for completeness

Output:

@time, #tweets, sentiment index: % positive, % negative
2012-01-16 01:40:38,69,55.07%,44.92%
2012-01-16 01:43:38,58,60.34%,39.65%
2012-01-16 01:46:39,59,57.62%,42.37%
2012-01-16 01:49:40,48,64.58%,35.41%
2012-01-16 01:52:40,52,59.61%,40.38%
2012-01-16 01:55:41,52,67.30%,32.69%
2012-01-16 01:58:42,50,72.0%,28.0%





Wednesday, December 14, 2011

Document Auto-Summarizer

In these busy times, when reading news stories, it is sometimes interesting to just get a quick gist of the news and read the details later only if the story is very interesting. Some news sites encourage this behavior by providing the first paragraph or the first few lines of a news story with a convenient URL link below with a "read more..." caption underneath to encourage those interested to read the rest of the story.

However, there is potentially a more useful way of engaging one's audience. For instance, instead of just posting the first paragraph of the story, wouldn't it be nice if the story were summarized? This way, the most important points from the story can be read and assimilated by the reader with minimal loss of content and details that might not matter can be ignored. The time savings in this case may more than offset the loss of detail in the summary.

What we need to build an auto-summarizer is already available to us. In this post we study how to build such a system. What follows right now is a description of how such a system could work, in a later edit to the post we share the actual source code.

In another post we have already seen how TF-IDF works. Now, given a corpus full of news stories (I use some public domain English novels downloaded from Project Gutenberg along with the news stories to be summarized that I scraped from some news websites as my corpus), we can construct the TF-IDF values for every unique word in each file. This is easily done with a few tweaks to the TF-IDF implementation from the other post. (We include the tweaked versions in full below as part of the source code for this project.)

Next, we use a simple heuristic to aid us in auto-summarizing documents with the following steps:
  1. We compute the sum of TF-IDF scores for each word in each sentence in the document, and index each sentence by this score divided by the number of words in it. This metric helps us treat all sentences fairly without giving preferential treatment to longer sentences.
  2. Next, we sort the sentences in descending order of the metric computed above, and cut off this list at about a third of the total number of sentences (this metric may be changed as appropriate).
  3. Finally, we generate output. To do this, we always include the first sentence of any story to give context, then we list those sentences from the story, in order, that appear in the sorted list generated in step 2. These are the only sentences that made the cut. Statistically speaking, these are the most interesting or relevant sentences from the story.
  4. Print out these sentences under the story header as a summary of the story in question.
If we want to limit the summary size to a pre-defined word-threshold, we can easily do this by scaling the number of selected sentences up or down in step 3. With this, we have a document auto-summarizer. While it works in a useful and usable sense, the implementation does not have any idea of context. Thus, the summary might sometimes appear "jumpy" (may not flow as well as a precis written by a human for example), and might not be completely contextually accurate.

Code follows below, along with a sample full story and the corresponding auto-generated summary:

tf_summarizer.py: (tf code tweaked for the auto-summarizer implementation)

import os, sys, math;
import pickle;
from os.path import join;


def delchrs(x,chrs):
 x=x.lower().replace("'s","");
 for i in chrs: x=x.replace(i,"");
 return x;


def tf(f): # reads file f and returns dict of words from f with count
 wrds,wc=[],{};
 fh=open(f,"r");
 fc=fh.readlines();
 for i in fc: wrds+=i.replace("\n","").split(" ");
 wrds=[delchrs(i,".?,;:*\"\'()!_") for i in wrds];
 for i in wrds:
  if i not in wc: wc[i]=wrds.count(i)/float(len(wrds)); #10:49 PM 12/5/2011
 return wc;


g=open("txtfiles.txt","w");
f=[];
for root, dirs, files in os.walk("c:\\"):
 for name in files:
  if name[-4:]==".txt" and ((root.find("summarizer")>-1 and len(name)==36) or root.find("gutenberg")>-1): 
   f+=[join(root,name)];
   g.write(f[-1]+"\n");
   print f[-1];
g.close();


# build tf database
tfdict={};
for i in f:
 print "processing "+i+" ...";
 tfdict[i.split("\\")[-1].split("_")[0]]=tf(i);


tfpck=open("tfpickle.pkl","wb");
pickle.dump(tfdict,tfpck);
tfpck.close();


# collect all dict key lists
all_wrds,all_dicts=[],[];
for i in tfdict: 
 all_wrds+=tfdict[i].keys();
 all_dicts+=[tfdict[i].keys()];
all_wrds=list(set(all_wrds));


idf={};
g=open("idf.txt","w");
all_docs=float(len(all_dicts));
for i in all_wrds:
 idf[i]=math.log(all_docs/sum([1 for j in all_dicts if i in j]));
 g.write(i+","+str(idf[i])+"\n");
g.close(); 


idfpck=open("idfpickle.pkl","wb");
pickle.dump(idf,idfpck);
idfpck.close();


tfpck=open("tfpickle.pkl","rb");
x=pickle.load(tfpck);
tfpck.close();


idfpck=open("idfpickle.pkl","rb");
y=pickle.load(idfpck);
idfpck.close();


tfidf_summarizer.py:
(tfidf code tweaked for the auto-summarizer implementation also includes summarizer logic)

# 2nd step of the process - constructing the index
import os, sys, math;
import pickle, glob, re;
from operator import itemgetter;
from os.path import join;


# read the tf and idf objects
# tff contains the tf dictionaries for each file as index
tfpck=open("tfpickle.pkl","rb");
tff=pickle.load(tfpck);
tfpck.close();


# idfs contains the idf dictionary for each unique word in the corpus
idfpck=open("idfpickle.pkl","rb");
idfs=pickle.load(idfpck);
idfpck.close();


tfidf={};
for i in tff: 
 r={};
 for j in tff[i]: 
  r[j]=tff[i][j]*idfs[j];
 tfidf[i]=r;


def clnSentence(x):
 # keep only characters 32 to 126 after splitting
 x2=[i for i in x if ord(i)>=32 and ord(i)<=126];
 return "".join(x2).strip();


def getSentences(x): # split the story into constituent sentences
 r=[];
 f=open(x,"r");
 fc=f.read();
 fc1=fc.split("\n");
 for i in fc1: r+=[clnSentence(i)];
 return r;


x=glob.glob("*.txt");
x=[i for i in x if len(i)==36]; 
# we save stories to summarize with filenames composed of their MD5 hash
# makes it easy to locate documents to be processed, and explains "36" in check above.
g=open("summaries.txt","w");
dcts={};
for i in x:
 g.write("original story follows: \n");
 p=getSentences(i);
 for j in p: g.write(j);
 g.write("\n\n summary follows: \n");
 svals=[];
 # for each sentence, 
 #  break into words, 
 #  then compute total tfidf for each sentence... 
 #  divide each sentence tfidf by the number of words in the sentence
 #  create a dictionary with the sentence as key and average tfidf as value.
 #  sort sentences on value, pick top 30% of sentences in summary
 for j in p: 
  wrds=j.split(" ");
  line_tfidf=sum([tff[i][k] for k in wrds if k in tff[i]]);
  avg_line_tfidf=line_tfidf/float(len(wrds));
  svals+=[(j,avg_line_tfidf)];
 svals=sorted(svals, key=itemgetter(1), reverse=True); # descending sort
 svals=[j[0] for j in svals[:len(p)/3]][:-1];
 g.write(p[0]+" "); # always include first sentence for context
 for j in p[1:]: 
  if j in svals: g.write(j+" ");
 g.write("\n\n\n");


g.close();


Sample full story and generated summary:

original story follows: 
NEW YORK (CNNMoney) -- Household wealth took its biggest hit since the height of the 2008 financial meltdown during the third quarter, weakened by a downturn in stocks, according to a report issued Thursday.The Federal Reserve said the net worth of households fell by $2.2 trillion, or 4.1%, to end at $57.4 trillion. The decline comes to about $7,800 for every U.S. resident.1134PrintCommentIt's the biggest decline since the $5.6 trillion loss suffered in the fourth quarter of 2008.The drop in stocks in the quarter more than explained the overall decline in net worth, as the value of stocks held directly or indirectly fell by $3.2 trillion, or 17%.That's a bit worse than the 14% decline in the blue-chip Standard & Poor's 500 index during the quarter. The period included the downgrade of the U.S. credit rating by S&P, along with rising worries about the risk of a new U.S. recession and a new meltdown in the financial sector due to the European debt crisis.But U.S. stocks have performed well since the end of the third quarter, with the S&P 500 rebounding by 10.1% so far. So household net worth might be poised for a nice rebound.10 best stocks for 2012The report also showed that the value of real estate held by households fell by $98 billion, but that was a drop of only 0.6%. Homeowner equity in homes fell by an even smaller amount, only $44.6 billion, as homeowners cut the amount of mortgage debt they owed.Gregory Daco, principal U.S. economist for IHS Global Insight, said the report shows that a solid 2% gain in consumer spending in the quarter was being done by Americans dipping into their savings, not by any growth in income or net worth."IHS expects a strong fourth quarter rebound in household net worth resulting mostly from a rally in stock prices," he wrote in a note Thursday. "This should provide some support to consumer spending as employment growth continues to make progress and consumers cheer up ahead of the holidays."Household wealth plunged $16.3 trillion in the two years from early 2007 to the first quarter of 2009, and has slowly been climbing since then. But with the drop in the third quarter of this year, households find their net worth still $9.4 trillion, or 14%, below the high they hit in early 2007, before the bursting of the housing bubble.

summary follows: 
NEW YORK (CNNMoney) -- Household wealth took its biggest hit since the height of the 2008 financial meltdown during the third quarter, weakened by a downturn in stocks, according to a report issued Thursday. It's the biggest decline since the $5.6 trillion loss suffered in the fourth quarter of 2008. The drop in stocks in the quarter more than explained the overall decline in net worth, as the value of stocks held directly or indirectly fell by $3.2 trillion, or 17%. That's a bit worse than the 14% decline in the blue-chip Standard & Poor's 500 index during the quarter. The period included the downgrade of the U.S. credit rating by S&P, along with rising worries about the risk of a new U.S. recession and a new meltdown in the financial sector due to the European debt crisis. Household wealth plunged $16.3 trillion in the two years from early 2007 to the first quarter of 2009, and has slowly been climbing since then. But with the drop in the third quarter of this year, households find their net worth still $9.4 trillion, or 14%, below the high they hit in early 2007, before the bursting of the housing bubble.




Wednesday, December 7, 2011

What I think about when I think about Search

We make no claims that anything on this blog is innovative in any real sense - these ideas have all been developed and implemented already, and many companies have made lots of money off of them. What we do here though is study some of these ideas with a view to developing a deeper understanding of how some of these things might work, and this might potentially help us design really innovative services over time...

When we run the tf-idf program from my other post, we get a dictionary (in the python sense) for each file that consists of all unique words in the file as keys with the ratio of their frequency of occurrence to the total number of words in the file as the associated values. If we consider a corpus with a large set of documents (we assume the search engine's crawlers have already built - and are continuing to add to - this corpus in the background), and a search engine GUI that takes in a set of words from a user, the search process may, in its simplest form, proceed as follows:

  1. look at all words, delete "stop words" (these are words like "is", "the" etc., that occur very frequently and don't really add much value to processing)
  2. process the rest of the words to extract a "workable" form from them using mechanisms like "Porter's Stemming Algorithm". This reduces word derivations to their simplest form (e.g. "search", "searching", "searched" etc., are all strings that mean more or less the same thing, and while searching for documents with one, we shouldn't preclude matches for documents with other variants).
  3. next, find all documents that have all (in the simplest implementation) or at least one (slightly more involved implementation) of the words input by the user.
  4. sort these documents by some function of the weights of their tf-idfs (say the sum of the tf-idfs) for all documents selected in step (3) above. 
  5. return the list from (4) as the result of the search query.
Search engines differentiate themselves primarily in how they perform (3) and (4) above. Google distinguished itself from other search engines in the early days through creative use of the Page-Rank algorithm (designed by Larry Page - nice pun!) and has maintained its lead with advances to the algorithm over time.

I tend to view each document in tf-idf space (or rather, the generated dictionary for each document to be more precise), as a vector in n-dimensional space where the number of dimensions is the same as the number of unique words in it. The search operation then squashes every document into a lower dimensional space (LDS) whose dimensions are defined by the words in the "cleaned up" input search string (post step 2 above). Then, we effectively compute the angle between each document in the LDS with the vector defined by the search string. The smaller the angle, and the greater the magnitude of the document vector, the higher the document scores in our search results... A geometric way of thinking about how Search works.

References: 
"Math Will Rock Your World" article from Business Week from some time ago.
"Anatomy of a Search Engine" by Sergei Brin & Larry Page

tf-idf implementation

To build most (if not all) of the services we described in our introductory post, we need an implementation of a TF-IDF (term frequency-inverse document frequency) library. This is described very well here. There are several implementations in various languages including C, C++, C# etc. (e.g. Niniane has also contributed an implementation here). However, for us to truly understand how this works, we build our own in Python as below.

The idea is simple. First we compute a term-frequency of each word in a document by computing the number of occurrences of each word in the document normalized to the document length (so words occurring in documents of different length are not (dis)advantaged in any way). Next, we compute the inverse document frequency, or rather the (mathematically transformed) reciprocal of the frequency of occurrence of words within the set of documents under consideration (called the "corpus"). Then we multiply the two metrics for each word so words that occur relatively frequently in a document but infrequently in the corpus get the highest score. I believe Amazon.com used to have a listing of "Statistically Improbably Phrases" (or SIPs) from each book they sold on their website sometime back. This is akin to what TF-IDF gives us.

The source code for our simple implementation is attached below ... and I will edit this post to make additions (more functionality, explanatory comments etc.) and modifications over time. No, this is not the most efficient implementation - C implementations may be x1000 faster. But what it lacks in performance it makes up for in simplicity, clarity and transparency.

For projects we do later, like the desktop search prototype we build atop this, we need to wrap this tf-idf computation in logic that handles additional events. For instance, a new file may be added to the corpus (e.g. if we are processing all .txt files on a computer, the user may have created a new .txt file since the last indexing operation on this machine), a file may be deleted (also from the corpus), or a file may be modified (this could change the tf values for words in the file, and idf values for words across the corpus). These details will be addressed in the desktop search implementation, we will gloss over these just now.

Please see disclaimer and copyright notice from the main post for all code.


tf.py:

import os, sys, math;
import pickle;
from os.path import join;


def delchrs(x,chrs):
 x=x.lower().replace("'s","");
 for i in chrs: x=x.replace(i,"");
 return x;


def tf(f): # reads file f and returns dict of words from f with count
 wrds,wc=[],{};
 fh=open(f,"r");
 fc=fh.readlines();
 for i in fc: wrds+=i.replace("\n","").split(" ");
 wrds=[delchrs(i,".?,;:*\"\'()!_") for i in wrds];
 for i in wrds:
  if i not in wc: wc[i]=wrds.count(i)/float(len(wrds));
 return wc;


g=open("txtfiles.txt","w");
f=[];
for root, dirs, files in os.walk("c:\\"):
 for name in files:
  if name[-4:]==".txt" and root.find("gutenberg")>-1:  # used a corpus of project gutenberg novels to test.
   f+=[join(root,name)];
   g.write(f[-1]+"\n");
   print f[-1];
g.close();


# build tf database
tfdict={};
for i in f:
 print "processing "+i+" ...";
 tfdict[i.split("\\")[-1].split("_")[0]]=tf(i);


tfpck=open("tfpickle.pkl","wb");
pickle.dump(tfdict,tfpck);
tfpck.close();


# collect all dict key lists
all_wrds,all_dicts=[],[];
for i in tfdict: 
 all_wrds+=tfdict[i].keys();
 all_dicts+=[tfdict[i].keys()];
all_wrds=list(set(all_wrds));


idf={};
g=open("idf.txt","w");
all_docs=float(len(all_dicts));
for i in all_wrds:
 idf[i]=math.log(all_docs/sum([1 for j in all_dicts if i in j]));
 g.write(i+","+str(idf[i])+"\n");
g.close(); 


idfpck=open("idfpickle.pkl","wb");
pickle.dump(idf,idfpck);
idfpck.close();


tfpck=open("tfpickle.pkl","rb");
x=pickle.load(tfpck);
tfpck.close();


idfpck=open("idfpickle.pkl","rb");
y=pickle.load(idfpck);
idfpck.close();





idf.py

# 2nd step of the process - constructing the index
import os, sys, math;
import pickle;
from os.path import join;


# read the tf and idf objects
# tff contains the tf dictionaries for each file as index
tfpck=open("tfpickle.pkl","rb");
tff=pickle.load(tfpck);
tfpck.close();


# idfs contains the idf dictionary for each unique word in the corpus
idfpck=open("idfpickle.pkl","rb");
idfs=pickle.load(idfpck);
idfpck.close();


tfidf={};
for i in tff: 
 r={};
 for j in tff[i]: 
  r[j]=tff[i][j]*idfs[j];
 tfidf[i]=r;


g=open("imp-words.txt","w");
# print 
dcts={};
for i in tff:
 dct={};
 x=tfidf[i].values();
 y=tfidf[i].keys();
 for j in zip(x,y): 
  if j[0] in dct: dct[j[0]]+=[j[1]];
  else: dct[j[0]]=[j[1]];
 x=list(set(x));
 x.sort();
 x.reverse();
 g.write("=> "+i+"\n");
 #print "=> ",i;
 for j in x[:25]:
  #print "   ",j,dct[j];
  g.write("    "+str(round(j,3))+" "+str(dct[j])+"\n");
 dcts[i]=dct;
g.close();