Wednesday, December 7, 2011

What I think about when I think about Search

We make no claims that anything on this blog is innovative in any real sense - these ideas have all been developed and implemented already, and many companies have made lots of money off of them. What we do here though is study some of these ideas with a view to developing a deeper understanding of how some of these things might work, and this might potentially help us design really innovative services over time...

When we run the tf-idf program from my other post, we get a dictionary (in the python sense) for each file that consists of all unique words in the file as keys with the ratio of their frequency of occurrence to the total number of words in the file as the associated values. If we consider a corpus with a large set of documents (we assume the search engine's crawlers have already built - and are continuing to add to - this corpus in the background), and a search engine GUI that takes in a set of words from a user, the search process may, in its simplest form, proceed as follows:

  1. look at all words, delete "stop words" (these are words like "is", "the" etc., that occur very frequently and don't really add much value to processing)
  2. process the rest of the words to extract a "workable" form from them using mechanisms like "Porter's Stemming Algorithm". This reduces word derivations to their simplest form (e.g. "search", "searching", "searched" etc., are all strings that mean more or less the same thing, and while searching for documents with one, we shouldn't preclude matches for documents with other variants).
  3. next, find all documents that have all (in the simplest implementation) or at least one (slightly more involved implementation) of the words input by the user.
  4. sort these documents by some function of the weights of their tf-idfs (say the sum of the tf-idfs) for all documents selected in step (3) above. 
  5. return the list from (4) as the result of the search query.
Search engines differentiate themselves primarily in how they perform (3) and (4) above. Google distinguished itself from other search engines in the early days through creative use of the Page-Rank algorithm (designed by Larry Page - nice pun!) and has maintained its lead with advances to the algorithm over time.

I tend to view each document in tf-idf space (or rather, the generated dictionary for each document to be more precise), as a vector in n-dimensional space where the number of dimensions is the same as the number of unique words in it. The search operation then squashes every document into a lower dimensional space (LDS) whose dimensions are defined by the words in the "cleaned up" input search string (post step 2 above). Then, we effectively compute the angle between each document in the LDS with the vector defined by the search string. The smaller the angle, and the greater the magnitude of the document vector, the higher the document scores in our search results... A geometric way of thinking about how Search works.

References: 
"Math Will Rock Your World" article from Business Week from some time ago.
"Anatomy of a Search Engine" by Sergei Brin & Larry Page