Saturday, June 2, 2012

The Itsy Bitsy Spider

In nature, spiders are beautiful things. Yes, they can be classified as creepy things when we look at particular specimens, with their long spindly legs and abrupt quick movements. But viewed in the abstract, they do some very unique things - for one, the fibers they emit to build spider-webs are considered to be stronger and more resilient than steel fibers of the same diameter. But in this blog we talk about a different kind of spider altogether.

The WWW is a mesh or web of hyperlinks. What better creature to crawl this abstract space and extract information from this structure of data and knowledge than a spider - and indeed, programs that perform this function are called spiders. In this post, we present a simple Python spider program that crawls the web. (a snake and a spider in one sentence ...)

Let us look at some applications of this kind of a program.

  1. let us say you are trying to scrape all content that hangs off a single web-page across a set of multiple hyperlinks. One way of gathering all this content would be to write a program that iteratively processes each link on the page and collects the data it reads, organizing it into separate files. Freeware programs like HTTrack, the website copier, do just this. But this is a severely limited spider since it is restricted to just those links that either a. hang off the specified web-page, or to links that are hosted by a particular domain or sub-domain of the internet e.g. www.my-company.com
  2. the more exotic application - document retrieval for web search. For Google to be able to respond to search queries, it must first build an indexed data set of various documents from the web. To do this, Google unleashes multiple multi-threaded spiders onto the web, and they collect all documents it makes sense to collect (using some intelligent criteria), following links from one document to the next, until a large part of the web is gathered. Google then indexes these documents to construct a data set that can be rapidly queried to respond to user queries.
  3. Avinash Kaushik, in his excellent books, distinguishes between internal (as in, within a company's intranet, for employees and other internal users) and external search applications with a focus on web analytics. These two perhaps do merit special discussion even in this simple classification which is why we include this bullet item here. However the mechanics for (3) are very similar to those for (2) from a web-crawler perspective, with some of the limitations from (1) imposed on the spiders in question.
  4. desktop search - except here the spider crawls through the directory tree structure of one's file system instead of the Internet, but the operational principle is the same.
We focus our implementation on a very simple instance of (2) above. Also, there is etiquette one needs to follow for building spiders (e.g. the robots exclusion protocol), and since we are not building an industrial strength implementation but one just for fun, we gloss over these details and "hobble" our implementation, constraining it to not explore more than a pre-set maximum number of links before it halts.

Also, for the layperson reading this, while it is somewhat glamorous to visualize this process as the spider crawling the WWW, what really happens of course is that the spider program is resident on the host computer running it, it is simply downloading html documents from the specified starting site, following hyperlinks wherever they may lead, then downloading those other documents,... and so on.

A comparison of how many web-pages were indexed by the various web search engines shows how up to a point, increasing the index size improves search quality. We say "up to a point" because a web search engine  called Cuil (2008-2010), indexed slightly over 120B web-pages at one time, but an unfortunate inability to produce search results of comparable quality to other leading search engines led to its early demise. To be a popular web search destination, it is not just important to index as many meaningful, non-repeating, non-machine-(auto) generated, non-interstitial-screen pages from the web, it is also important to return results that provide the kinds of information the user is looking for.

Code is below. Sample output follows the code.


import os, sys, urllib2, re; 


u=sys.argv[1]; # the url to start with
lynx=[u]; # store the start url into the download set
for u in lynx: #process each link in turn
 r=urllib2.urlopen(u); # get the requested object
 x=r.read(); # read the document from the website
 xlines=x.split(" "); # split and process the lines
 c=[i for i in xlines if i.lower().find("http://")>-1]; # collect all hyperlinks
 for i in c: # process each hyperlink, ignoring some words
  if i.find("CALLOUT|")>-1: continue;
  stpos=i.index("http://");
  if i.find(".html")==-1: continue;
  if i.find("\n")>-1: continue;
  enpos=[]; # code segment that follows parses the retrieved document
  enpos+=[k.start() for k in re.finditer(".html",i)];
  #print stpos,enpos;
  enpos=[j for j in enpos if j>stpos+13];
  if len(enpos)==0: continue;
  enpos=min(enpos);
  candidate=i[stpos:enpos]+".html"; # store retrieved urls
  if candidate not in lynx: lynx+=[candidate];
 if len(lynx)>100: break; # hobble the spider so it doesn't go haywire


g=open("t.txt","w"); # write crawled urls to file
for i in lynx: g.write(i+"\n");
g.close();




No comments:

Post a Comment