Wednesday, December 7, 2011

tf-idf implementation

To build most (if not all) of the services we described in our introductory post, we need an implementation of a TF-IDF (term frequency-inverse document frequency) library. This is described very well here. There are several implementations in various languages including C, C++, C# etc. (e.g. Niniane has also contributed an implementation here). However, for us to truly understand how this works, we build our own in Python as below.

The idea is simple. First we compute a term-frequency of each word in a document by computing the number of occurrences of each word in the document normalized to the document length (so words occurring in documents of different length are not (dis)advantaged in any way). Next, we compute the inverse document frequency, or rather the (mathematically transformed) reciprocal of the frequency of occurrence of words within the set of documents under consideration (called the "corpus"). Then we multiply the two metrics for each word so words that occur relatively frequently in a document but infrequently in the corpus get the highest score. I believe Amazon.com used to have a listing of "Statistically Improbably Phrases" (or SIPs) from each book they sold on their website sometime back. This is akin to what TF-IDF gives us.

The source code for our simple implementation is attached below ... and I will edit this post to make additions (more functionality, explanatory comments etc.) and modifications over time. No, this is not the most efficient implementation - C implementations may be x1000 faster. But what it lacks in performance it makes up for in simplicity, clarity and transparency.

For projects we do later, like the desktop search prototype we build atop this, we need to wrap this tf-idf computation in logic that handles additional events. For instance, a new file may be added to the corpus (e.g. if we are processing all .txt files on a computer, the user may have created a new .txt file since the last indexing operation on this machine), a file may be deleted (also from the corpus), or a file may be modified (this could change the tf values for words in the file, and idf values for words across the corpus). These details will be addressed in the desktop search implementation, we will gloss over these just now.

Please see disclaimer and copyright notice from the main post for all code.


tf.py:

import os, sys, math;
import pickle;
from os.path import join;


def delchrs(x,chrs):
 x=x.lower().replace("'s","");
 for i in chrs: x=x.replace(i,"");
 return x;


def tf(f): # reads file f and returns dict of words from f with count
 wrds,wc=[],{};
 fh=open(f,"r");
 fc=fh.readlines();
 for i in fc: wrds+=i.replace("\n","").split(" ");
 wrds=[delchrs(i,".?,;:*\"\'()!_") for i in wrds];
 for i in wrds:
  if i not in wc: wc[i]=wrds.count(i)/float(len(wrds));
 return wc;


g=open("txtfiles.txt","w");
f=[];
for root, dirs, files in os.walk("c:\\"):
 for name in files:
  if name[-4:]==".txt" and root.find("gutenberg")>-1:  # used a corpus of project gutenberg novels to test.
   f+=[join(root,name)];
   g.write(f[-1]+"\n");
   print f[-1];
g.close();


# build tf database
tfdict={};
for i in f:
 print "processing "+i+" ...";
 tfdict[i.split("\\")[-1].split("_")[0]]=tf(i);


tfpck=open("tfpickle.pkl","wb");
pickle.dump(tfdict,tfpck);
tfpck.close();


# collect all dict key lists
all_wrds,all_dicts=[],[];
for i in tfdict: 
 all_wrds+=tfdict[i].keys();
 all_dicts+=[tfdict[i].keys()];
all_wrds=list(set(all_wrds));


idf={};
g=open("idf.txt","w");
all_docs=float(len(all_dicts));
for i in all_wrds:
 idf[i]=math.log(all_docs/sum([1 for j in all_dicts if i in j]));
 g.write(i+","+str(idf[i])+"\n");
g.close(); 


idfpck=open("idfpickle.pkl","wb");
pickle.dump(idf,idfpck);
idfpck.close();


tfpck=open("tfpickle.pkl","rb");
x=pickle.load(tfpck);
tfpck.close();


idfpck=open("idfpickle.pkl","rb");
y=pickle.load(idfpck);
idfpck.close();





idf.py

# 2nd step of the process - constructing the index
import os, sys, math;
import pickle;
from os.path import join;


# read the tf and idf objects
# tff contains the tf dictionaries for each file as index
tfpck=open("tfpickle.pkl","rb");
tff=pickle.load(tfpck);
tfpck.close();


# idfs contains the idf dictionary for each unique word in the corpus
idfpck=open("idfpickle.pkl","rb");
idfs=pickle.load(idfpck);
idfpck.close();


tfidf={};
for i in tff: 
 r={};
 for j in tff[i]: 
  r[j]=tff[i][j]*idfs[j];
 tfidf[i]=r;


g=open("imp-words.txt","w");
# print 
dcts={};
for i in tff:
 dct={};
 x=tfidf[i].values();
 y=tfidf[i].keys();
 for j in zip(x,y): 
  if j[0] in dct: dct[j[0]]+=[j[1]];
  else: dct[j[0]]=[j[1]];
 x=list(set(x));
 x.sort();
 x.reverse();
 g.write("=> "+i+"\n");
 #print "=> ",i;
 for j in x[:25]:
  #print "   ",j,dct[j];
  g.write("    "+str(round(j,3))+" "+str(dct[j])+"\n");
 dcts[i]=dct;
g.close();