Wednesday, December 14, 2011

Document Auto-Summarizer

In these busy times, when reading news stories, it is sometimes interesting to just get a quick gist of the news and read the details later only if the story is very interesting. Some news sites encourage this behavior by providing the first paragraph or the first few lines of a news story with a convenient URL link below with a "read more..." caption underneath to encourage those interested to read the rest of the story.

However, there is potentially a more useful way of engaging one's audience. For instance, instead of just posting the first paragraph of the story, wouldn't it be nice if the story were summarized? This way, the most important points from the story can be read and assimilated by the reader with minimal loss of content and details that might not matter can be ignored. The time savings in this case may more than offset the loss of detail in the summary.

What we need to build an auto-summarizer is already available to us. In this post we study how to build such a system. What follows right now is a description of how such a system could work, in a later edit to the post we share the actual source code.

In another post we have already seen how TF-IDF works. Now, given a corpus full of news stories (I use some public domain English novels downloaded from Project Gutenberg along with the news stories to be summarized that I scraped from some news websites as my corpus), we can construct the TF-IDF values for every unique word in each file. This is easily done with a few tweaks to the TF-IDF implementation from the other post. (We include the tweaked versions in full below as part of the source code for this project.)

Next, we use a simple heuristic to aid us in auto-summarizing documents with the following steps:
  1. We compute the sum of TF-IDF scores for each word in each sentence in the document, and index each sentence by this score divided by the number of words in it. This metric helps us treat all sentences fairly without giving preferential treatment to longer sentences.
  2. Next, we sort the sentences in descending order of the metric computed above, and cut off this list at about a third of the total number of sentences (this metric may be changed as appropriate).
  3. Finally, we generate output. To do this, we always include the first sentence of any story to give context, then we list those sentences from the story, in order, that appear in the sorted list generated in step 2. These are the only sentences that made the cut. Statistically speaking, these are the most interesting or relevant sentences from the story.
  4. Print out these sentences under the story header as a summary of the story in question.
If we want to limit the summary size to a pre-defined word-threshold, we can easily do this by scaling the number of selected sentences up or down in step 3. With this, we have a document auto-summarizer. While it works in a useful and usable sense, the implementation does not have any idea of context. Thus, the summary might sometimes appear "jumpy" (may not flow as well as a precis written by a human for example), and might not be completely contextually accurate.

Code follows below, along with a sample full story and the corresponding auto-generated summary:

tf_summarizer.py: (tf code tweaked for the auto-summarizer implementation)

import os, sys, math;
import pickle;
from os.path import join;


def delchrs(x,chrs):
 x=x.lower().replace("'s","");
 for i in chrs: x=x.replace(i,"");
 return x;


def tf(f): # reads file f and returns dict of words from f with count
 wrds,wc=[],{};
 fh=open(f,"r");
 fc=fh.readlines();
 for i in fc: wrds+=i.replace("\n","").split(" ");
 wrds=[delchrs(i,".?,;:*\"\'()!_") for i in wrds];
 for i in wrds:
  if i not in wc: wc[i]=wrds.count(i)/float(len(wrds)); #10:49 PM 12/5/2011
 return wc;


g=open("txtfiles.txt","w");
f=[];
for root, dirs, files in os.walk("c:\\"):
 for name in files:
  if name[-4:]==".txt" and ((root.find("summarizer")>-1 and len(name)==36) or root.find("gutenberg")>-1): 
   f+=[join(root,name)];
   g.write(f[-1]+"\n");
   print f[-1];
g.close();


# build tf database
tfdict={};
for i in f:
 print "processing "+i+" ...";
 tfdict[i.split("\\")[-1].split("_")[0]]=tf(i);


tfpck=open("tfpickle.pkl","wb");
pickle.dump(tfdict,tfpck);
tfpck.close();


# collect all dict key lists
all_wrds,all_dicts=[],[];
for i in tfdict: 
 all_wrds+=tfdict[i].keys();
 all_dicts+=[tfdict[i].keys()];
all_wrds=list(set(all_wrds));


idf={};
g=open("idf.txt","w");
all_docs=float(len(all_dicts));
for i in all_wrds:
 idf[i]=math.log(all_docs/sum([1 for j in all_dicts if i in j]));
 g.write(i+","+str(idf[i])+"\n");
g.close(); 


idfpck=open("idfpickle.pkl","wb");
pickle.dump(idf,idfpck);
idfpck.close();


tfpck=open("tfpickle.pkl","rb");
x=pickle.load(tfpck);
tfpck.close();


idfpck=open("idfpickle.pkl","rb");
y=pickle.load(idfpck);
idfpck.close();


tfidf_summarizer.py:
(tfidf code tweaked for the auto-summarizer implementation also includes summarizer logic)

# 2nd step of the process - constructing the index
import os, sys, math;
import pickle, glob, re;
from operator import itemgetter;
from os.path import join;


# read the tf and idf objects
# tff contains the tf dictionaries for each file as index
tfpck=open("tfpickle.pkl","rb");
tff=pickle.load(tfpck);
tfpck.close();


# idfs contains the idf dictionary for each unique word in the corpus
idfpck=open("idfpickle.pkl","rb");
idfs=pickle.load(idfpck);
idfpck.close();


tfidf={};
for i in tff: 
 r={};
 for j in tff[i]: 
  r[j]=tff[i][j]*idfs[j];
 tfidf[i]=r;


def clnSentence(x):
 # keep only characters 32 to 126 after splitting
 x2=[i for i in x if ord(i)>=32 and ord(i)<=126];
 return "".join(x2).strip();


def getSentences(x): # split the story into constituent sentences
 r=[];
 f=open(x,"r");
 fc=f.read();
 fc1=fc.split("\n");
 for i in fc1: r+=[clnSentence(i)];
 return r;


x=glob.glob("*.txt");
x=[i for i in x if len(i)==36]; 
# we save stories to summarize with filenames composed of their MD5 hash
# makes it easy to locate documents to be processed, and explains "36" in check above.
g=open("summaries.txt","w");
dcts={};
for i in x:
 g.write("original story follows: \n");
 p=getSentences(i);
 for j in p: g.write(j);
 g.write("\n\n summary follows: \n");
 svals=[];
 # for each sentence, 
 #  break into words, 
 #  then compute total tfidf for each sentence... 
 #  divide each sentence tfidf by the number of words in the sentence
 #  create a dictionary with the sentence as key and average tfidf as value.
 #  sort sentences on value, pick top 30% of sentences in summary
 for j in p: 
  wrds=j.split(" ");
  line_tfidf=sum([tff[i][k] for k in wrds if k in tff[i]]);
  avg_line_tfidf=line_tfidf/float(len(wrds));
  svals+=[(j,avg_line_tfidf)];
 svals=sorted(svals, key=itemgetter(1), reverse=True); # descending sort
 svals=[j[0] for j in svals[:len(p)/3]][:-1];
 g.write(p[0]+" "); # always include first sentence for context
 for j in p[1:]: 
  if j in svals: g.write(j+" ");
 g.write("\n\n\n");


g.close();


Sample full story and generated summary:

original story follows: 
NEW YORK (CNNMoney) -- Household wealth took its biggest hit since the height of the 2008 financial meltdown during the third quarter, weakened by a downturn in stocks, according to a report issued Thursday.The Federal Reserve said the net worth of households fell by $2.2 trillion, or 4.1%, to end at $57.4 trillion. The decline comes to about $7,800 for every U.S. resident.1134PrintCommentIt's the biggest decline since the $5.6 trillion loss suffered in the fourth quarter of 2008.The drop in stocks in the quarter more than explained the overall decline in net worth, as the value of stocks held directly or indirectly fell by $3.2 trillion, or 17%.That's a bit worse than the 14% decline in the blue-chip Standard & Poor's 500 index during the quarter. The period included the downgrade of the U.S. credit rating by S&P, along with rising worries about the risk of a new U.S. recession and a new meltdown in the financial sector due to the European debt crisis.But U.S. stocks have performed well since the end of the third quarter, with the S&P 500 rebounding by 10.1% so far. So household net worth might be poised for a nice rebound.10 best stocks for 2012The report also showed that the value of real estate held by households fell by $98 billion, but that was a drop of only 0.6%. Homeowner equity in homes fell by an even smaller amount, only $44.6 billion, as homeowners cut the amount of mortgage debt they owed.Gregory Daco, principal U.S. economist for IHS Global Insight, said the report shows that a solid 2% gain in consumer spending in the quarter was being done by Americans dipping into their savings, not by any growth in income or net worth."IHS expects a strong fourth quarter rebound in household net worth resulting mostly from a rally in stock prices," he wrote in a note Thursday. "This should provide some support to consumer spending as employment growth continues to make progress and consumers cheer up ahead of the holidays."Household wealth plunged $16.3 trillion in the two years from early 2007 to the first quarter of 2009, and has slowly been climbing since then. But with the drop in the third quarter of this year, households find their net worth still $9.4 trillion, or 14%, below the high they hit in early 2007, before the bursting of the housing bubble.

summary follows: 
NEW YORK (CNNMoney) -- Household wealth took its biggest hit since the height of the 2008 financial meltdown during the third quarter, weakened by a downturn in stocks, according to a report issued Thursday. It's the biggest decline since the $5.6 trillion loss suffered in the fourth quarter of 2008. The drop in stocks in the quarter more than explained the overall decline in net worth, as the value of stocks held directly or indirectly fell by $3.2 trillion, or 17%. That's a bit worse than the 14% decline in the blue-chip Standard & Poor's 500 index during the quarter. The period included the downgrade of the U.S. credit rating by S&P, along with rising worries about the risk of a new U.S. recession and a new meltdown in the financial sector due to the European debt crisis. Household wealth plunged $16.3 trillion in the two years from early 2007 to the first quarter of 2009, and has slowly been climbing since then. But with the drop in the third quarter of this year, households find their net worth still $9.4 trillion, or 14%, below the high they hit in early 2007, before the bursting of the housing bubble.