Wednesday, December 14, 2011

Document Auto-Summarizer

In these busy times, when reading news stories, it is sometimes interesting to just get a quick gist of the news and read the details later only if the story is very interesting. Some news sites encourage this behavior by providing the first paragraph or the first few lines of a news story with a convenient URL link below with a "read more..." caption underneath to encourage those interested to read the rest of the story.

However, there is potentially a more useful way of engaging one's audience. For instance, instead of just posting the first paragraph of the story, wouldn't it be nice if the story were summarized? This way, the most important points from the story can be read and assimilated by the reader with minimal loss of content and details that might not matter can be ignored. The time savings in this case may more than offset the loss of detail in the summary.

What we need to build an auto-summarizer is already available to us. In this post we study how to build such a system. What follows right now is a description of how such a system could work, in a later edit to the post we share the actual source code.

In another post we have already seen how TF-IDF works. Now, given a corpus full of news stories (I use some public domain English novels downloaded from Project Gutenberg along with the news stories to be summarized that I scraped from some news websites as my corpus), we can construct the TF-IDF values for every unique word in each file. This is easily done with a few tweaks to the TF-IDF implementation from the other post. (We include the tweaked versions in full below as part of the source code for this project.)

Next, we use a simple heuristic to aid us in auto-summarizing documents with the following steps:
  1. We compute the sum of TF-IDF scores for each word in each sentence in the document, and index each sentence by this score divided by the number of words in it. This metric helps us treat all sentences fairly without giving preferential treatment to longer sentences.
  2. Next, we sort the sentences in descending order of the metric computed above, and cut off this list at about a third of the total number of sentences (this metric may be changed as appropriate).
  3. Finally, we generate output. To do this, we always include the first sentence of any story to give context, then we list those sentences from the story, in order, that appear in the sorted list generated in step 2. These are the only sentences that made the cut. Statistically speaking, these are the most interesting or relevant sentences from the story.
  4. Print out these sentences under the story header as a summary of the story in question.
If we want to limit the summary size to a pre-defined word-threshold, we can easily do this by scaling the number of selected sentences up or down in step 3. With this, we have a document auto-summarizer. While it works in a useful and usable sense, the implementation does not have any idea of context. Thus, the summary might sometimes appear "jumpy" (may not flow as well as a precis written by a human for example), and might not be completely contextually accurate.

Code follows below, along with a sample full story and the corresponding auto-generated summary:

tf_summarizer.py: (tf code tweaked for the auto-summarizer implementation)

import os, sys, math;
import pickle;
from os.path import join;


def delchrs(x,chrs):
 x=x.lower().replace("'s","");
 for i in chrs: x=x.replace(i,"");
 return x;


def tf(f): # reads file f and returns dict of words from f with count
 wrds,wc=[],{};
 fh=open(f,"r");
 fc=fh.readlines();
 for i in fc: wrds+=i.replace("\n","").split(" ");
 wrds=[delchrs(i,".?,;:*\"\'()!_") for i in wrds];
 for i in wrds:
  if i not in wc: wc[i]=wrds.count(i)/float(len(wrds)); #10:49 PM 12/5/2011
 return wc;


g=open("txtfiles.txt","w");
f=[];
for root, dirs, files in os.walk("c:\\"):
 for name in files:
  if name[-4:]==".txt" and ((root.find("summarizer")>-1 and len(name)==36) or root.find("gutenberg")>-1): 
   f+=[join(root,name)];
   g.write(f[-1]+"\n");
   print f[-1];
g.close();


# build tf database
tfdict={};
for i in f:
 print "processing "+i+" ...";
 tfdict[i.split("\\")[-1].split("_")[0]]=tf(i);


tfpck=open("tfpickle.pkl","wb");
pickle.dump(tfdict,tfpck);
tfpck.close();


# collect all dict key lists
all_wrds,all_dicts=[],[];
for i in tfdict: 
 all_wrds+=tfdict[i].keys();
 all_dicts+=[tfdict[i].keys()];
all_wrds=list(set(all_wrds));


idf={};
g=open("idf.txt","w");
all_docs=float(len(all_dicts));
for i in all_wrds:
 idf[i]=math.log(all_docs/sum([1 for j in all_dicts if i in j]));
 g.write(i+","+str(idf[i])+"\n");
g.close(); 


idfpck=open("idfpickle.pkl","wb");
pickle.dump(idf,idfpck);
idfpck.close();


tfpck=open("tfpickle.pkl","rb");
x=pickle.load(tfpck);
tfpck.close();


idfpck=open("idfpickle.pkl","rb");
y=pickle.load(idfpck);
idfpck.close();


tfidf_summarizer.py:
(tfidf code tweaked for the auto-summarizer implementation also includes summarizer logic)

# 2nd step of the process - constructing the index
import os, sys, math;
import pickle, glob, re;
from operator import itemgetter;
from os.path import join;


# read the tf and idf objects
# tff contains the tf dictionaries for each file as index
tfpck=open("tfpickle.pkl","rb");
tff=pickle.load(tfpck);
tfpck.close();


# idfs contains the idf dictionary for each unique word in the corpus
idfpck=open("idfpickle.pkl","rb");
idfs=pickle.load(idfpck);
idfpck.close();


tfidf={};
for i in tff: 
 r={};
 for j in tff[i]: 
  r[j]=tff[i][j]*idfs[j];
 tfidf[i]=r;


def clnSentence(x):
 # keep only characters 32 to 126 after splitting
 x2=[i for i in x if ord(i)>=32 and ord(i)<=126];
 return "".join(x2).strip();


def getSentences(x): # split the story into constituent sentences
 r=[];
 f=open(x,"r");
 fc=f.read();
 fc1=fc.split("\n");
 for i in fc1: r+=[clnSentence(i)];
 return r;


x=glob.glob("*.txt");
x=[i for i in x if len(i)==36]; 
# we save stories to summarize with filenames composed of their MD5 hash
# makes it easy to locate documents to be processed, and explains "36" in check above.
g=open("summaries.txt","w");
dcts={};
for i in x:
 g.write("original story follows: \n");
 p=getSentences(i);
 for j in p: g.write(j);
 g.write("\n\n summary follows: \n");
 svals=[];
 # for each sentence, 
 #  break into words, 
 #  then compute total tfidf for each sentence... 
 #  divide each sentence tfidf by the number of words in the sentence
 #  create a dictionary with the sentence as key and average tfidf as value.
 #  sort sentences on value, pick top 30% of sentences in summary
 for j in p: 
  wrds=j.split(" ");
  line_tfidf=sum([tff[i][k] for k in wrds if k in tff[i]]);
  avg_line_tfidf=line_tfidf/float(len(wrds));
  svals+=[(j,avg_line_tfidf)];
 svals=sorted(svals, key=itemgetter(1), reverse=True); # descending sort
 svals=[j[0] for j in svals[:len(p)/3]][:-1];
 g.write(p[0]+" "); # always include first sentence for context
 for j in p[1:]: 
  if j in svals: g.write(j+" ");
 g.write("\n\n\n");


g.close();


Sample full story and generated summary:

original story follows: 
NEW YORK (CNNMoney) -- Household wealth took its biggest hit since the height of the 2008 financial meltdown during the third quarter, weakened by a downturn in stocks, according to a report issued Thursday.The Federal Reserve said the net worth of households fell by $2.2 trillion, or 4.1%, to end at $57.4 trillion. The decline comes to about $7,800 for every U.S. resident.1134PrintCommentIt's the biggest decline since the $5.6 trillion loss suffered in the fourth quarter of 2008.The drop in stocks in the quarter more than explained the overall decline in net worth, as the value of stocks held directly or indirectly fell by $3.2 trillion, or 17%.That's a bit worse than the 14% decline in the blue-chip Standard & Poor's 500 index during the quarter. The period included the downgrade of the U.S. credit rating by S&P, along with rising worries about the risk of a new U.S. recession and a new meltdown in the financial sector due to the European debt crisis.But U.S. stocks have performed well since the end of the third quarter, with the S&P 500 rebounding by 10.1% so far. So household net worth might be poised for a nice rebound.10 best stocks for 2012The report also showed that the value of real estate held by households fell by $98 billion, but that was a drop of only 0.6%. Homeowner equity in homes fell by an even smaller amount, only $44.6 billion, as homeowners cut the amount of mortgage debt they owed.Gregory Daco, principal U.S. economist for IHS Global Insight, said the report shows that a solid 2% gain in consumer spending in the quarter was being done by Americans dipping into their savings, not by any growth in income or net worth."IHS expects a strong fourth quarter rebound in household net worth resulting mostly from a rally in stock prices," he wrote in a note Thursday. "This should provide some support to consumer spending as employment growth continues to make progress and consumers cheer up ahead of the holidays."Household wealth plunged $16.3 trillion in the two years from early 2007 to the first quarter of 2009, and has slowly been climbing since then. But with the drop in the third quarter of this year, households find their net worth still $9.4 trillion, or 14%, below the high they hit in early 2007, before the bursting of the housing bubble.

summary follows: 
NEW YORK (CNNMoney) -- Household wealth took its biggest hit since the height of the 2008 financial meltdown during the third quarter, weakened by a downturn in stocks, according to a report issued Thursday. It's the biggest decline since the $5.6 trillion loss suffered in the fourth quarter of 2008. The drop in stocks in the quarter more than explained the overall decline in net worth, as the value of stocks held directly or indirectly fell by $3.2 trillion, or 17%. That's a bit worse than the 14% decline in the blue-chip Standard & Poor's 500 index during the quarter. The period included the downgrade of the U.S. credit rating by S&P, along with rising worries about the risk of a new U.S. recession and a new meltdown in the financial sector due to the European debt crisis. Household wealth plunged $16.3 trillion in the two years from early 2007 to the first quarter of 2009, and has slowly been climbing since then. But with the drop in the third quarter of this year, households find their net worth still $9.4 trillion, or 14%, below the high they hit in early 2007, before the bursting of the housing bubble.




Wednesday, December 7, 2011

What I think about when I think about Search

We make no claims that anything on this blog is innovative in any real sense - these ideas have all been developed and implemented already, and many companies have made lots of money off of them. What we do here though is study some of these ideas with a view to developing a deeper understanding of how some of these things might work, and this might potentially help us design really innovative services over time...

When we run the tf-idf program from my other post, we get a dictionary (in the python sense) for each file that consists of all unique words in the file as keys with the ratio of their frequency of occurrence to the total number of words in the file as the associated values. If we consider a corpus with a large set of documents (we assume the search engine's crawlers have already built - and are continuing to add to - this corpus in the background), and a search engine GUI that takes in a set of words from a user, the search process may, in its simplest form, proceed as follows:

  1. look at all words, delete "stop words" (these are words like "is", "the" etc., that occur very frequently and don't really add much value to processing)
  2. process the rest of the words to extract a "workable" form from them using mechanisms like "Porter's Stemming Algorithm". This reduces word derivations to their simplest form (e.g. "search", "searching", "searched" etc., are all strings that mean more or less the same thing, and while searching for documents with one, we shouldn't preclude matches for documents with other variants).
  3. next, find all documents that have all (in the simplest implementation) or at least one (slightly more involved implementation) of the words input by the user.
  4. sort these documents by some function of the weights of their tf-idfs (say the sum of the tf-idfs) for all documents selected in step (3) above. 
  5. return the list from (4) as the result of the search query.
Search engines differentiate themselves primarily in how they perform (3) and (4) above. Google distinguished itself from other search engines in the early days through creative use of the Page-Rank algorithm (designed by Larry Page - nice pun!) and has maintained its lead with advances to the algorithm over time.

I tend to view each document in tf-idf space (or rather, the generated dictionary for each document to be more precise), as a vector in n-dimensional space where the number of dimensions is the same as the number of unique words in it. The search operation then squashes every document into a lower dimensional space (LDS) whose dimensions are defined by the words in the "cleaned up" input search string (post step 2 above). Then, we effectively compute the angle between each document in the LDS with the vector defined by the search string. The smaller the angle, and the greater the magnitude of the document vector, the higher the document scores in our search results... A geometric way of thinking about how Search works.

References: 
"Math Will Rock Your World" article from Business Week from some time ago.
"Anatomy of a Search Engine" by Sergei Brin & Larry Page

tf-idf implementation

To build most (if not all) of the services we described in our introductory post, we need an implementation of a TF-IDF (term frequency-inverse document frequency) library. This is described very well here. There are several implementations in various languages including C, C++, C# etc. (e.g. Niniane has also contributed an implementation here). However, for us to truly understand how this works, we build our own in Python as below.

The idea is simple. First we compute a term-frequency of each word in a document by computing the number of occurrences of each word in the document normalized to the document length (so words occurring in documents of different length are not (dis)advantaged in any way). Next, we compute the inverse document frequency, or rather the (mathematically transformed) reciprocal of the frequency of occurrence of words within the set of documents under consideration (called the "corpus"). Then we multiply the two metrics for each word so words that occur relatively frequently in a document but infrequently in the corpus get the highest score. I believe Amazon.com used to have a listing of "Statistically Improbably Phrases" (or SIPs) from each book they sold on their website sometime back. This is akin to what TF-IDF gives us.

The source code for our simple implementation is attached below ... and I will edit this post to make additions (more functionality, explanatory comments etc.) and modifications over time. No, this is not the most efficient implementation - C implementations may be x1000 faster. But what it lacks in performance it makes up for in simplicity, clarity and transparency.

For projects we do later, like the desktop search prototype we build atop this, we need to wrap this tf-idf computation in logic that handles additional events. For instance, a new file may be added to the corpus (e.g. if we are processing all .txt files on a computer, the user may have created a new .txt file since the last indexing operation on this machine), a file may be deleted (also from the corpus), or a file may be modified (this could change the tf values for words in the file, and idf values for words across the corpus). These details will be addressed in the desktop search implementation, we will gloss over these just now.

Please see disclaimer and copyright notice from the main post for all code.


tf.py:

import os, sys, math;
import pickle;
from os.path import join;


def delchrs(x,chrs):
 x=x.lower().replace("'s","");
 for i in chrs: x=x.replace(i,"");
 return x;


def tf(f): # reads file f and returns dict of words from f with count
 wrds,wc=[],{};
 fh=open(f,"r");
 fc=fh.readlines();
 for i in fc: wrds+=i.replace("\n","").split(" ");
 wrds=[delchrs(i,".?,;:*\"\'()!_") for i in wrds];
 for i in wrds:
  if i not in wc: wc[i]=wrds.count(i)/float(len(wrds));
 return wc;


g=open("txtfiles.txt","w");
f=[];
for root, dirs, files in os.walk("c:\\"):
 for name in files:
  if name[-4:]==".txt" and root.find("gutenberg")>-1:  # used a corpus of project gutenberg novels to test.
   f+=[join(root,name)];
   g.write(f[-1]+"\n");
   print f[-1];
g.close();


# build tf database
tfdict={};
for i in f:
 print "processing "+i+" ...";
 tfdict[i.split("\\")[-1].split("_")[0]]=tf(i);


tfpck=open("tfpickle.pkl","wb");
pickle.dump(tfdict,tfpck);
tfpck.close();


# collect all dict key lists
all_wrds,all_dicts=[],[];
for i in tfdict: 
 all_wrds+=tfdict[i].keys();
 all_dicts+=[tfdict[i].keys()];
all_wrds=list(set(all_wrds));


idf={};
g=open("idf.txt","w");
all_docs=float(len(all_dicts));
for i in all_wrds:
 idf[i]=math.log(all_docs/sum([1 for j in all_dicts if i in j]));
 g.write(i+","+str(idf[i])+"\n");
g.close(); 


idfpck=open("idfpickle.pkl","wb");
pickle.dump(idf,idfpck);
idfpck.close();


tfpck=open("tfpickle.pkl","rb");
x=pickle.load(tfpck);
tfpck.close();


idfpck=open("idfpickle.pkl","rb");
y=pickle.load(idfpck);
idfpck.close();





idf.py

# 2nd step of the process - constructing the index
import os, sys, math;
import pickle;
from os.path import join;


# read the tf and idf objects
# tff contains the tf dictionaries for each file as index
tfpck=open("tfpickle.pkl","rb");
tff=pickle.load(tfpck);
tfpck.close();


# idfs contains the idf dictionary for each unique word in the corpus
idfpck=open("idfpickle.pkl","rb");
idfs=pickle.load(idfpck);
idfpck.close();


tfidf={};
for i in tff: 
 r={};
 for j in tff[i]: 
  r[j]=tff[i][j]*idfs[j];
 tfidf[i]=r;


g=open("imp-words.txt","w");
# print 
dcts={};
for i in tff:
 dct={};
 x=tfidf[i].values();
 y=tfidf[i].keys();
 for j in zip(x,y): 
  if j[0] in dct: dct[j[0]]+=[j[1]];
  else: dct[j[0]]=[j[1]];
 x=list(set(x));
 x.sort();
 x.reverse();
 g.write("=> "+i+"\n");
 #print "=> ",i;
 for j in x[:25]:
  #print "   ",j,dct[j];
  g.write("    "+str(round(j,3))+" "+str(dct[j])+"\n");
 dcts[i]=dct;
g.close();

Saturday, December 3, 2011

The Grand Vision

Medusa's Cave is a blog where we discuss some Internet services and attempt to build very simple prototypes of some of them. In most cases our implementation will be in python 2.x. Knowledge of python is assumed, though we will try to walk through most important details of every implementation. The current goal (we reserve the right to change this at any time) is to cover the following:
  1. hyper-link checker, robust hyperlinks ... to web-crawler
  2. desktop search, discussion of web-search engines
  3. google news
  4. genetic algorithms to solve puzzles
  5. document auto-summarizer
  6. twitter sentiment analysis
  7. ... and other ideas that might occur to us along the way.
Disclaimers:
  1. All code on this website is provided as is (copyright the author of the Medusa's Cave blog), with no warranty whatsoever - implied or stated. Use strictly at your own risk. If you do use it, please cite the source (The Medusa's  Cave blog)
  2. This blog is something the author does for fun. No claims are made that the implementations discussed are the most efficient or the best way of doing things. The goal here is to learn, develop some ideas, and have some fun doing it.

A simple Google News - Lite Implementation

We describe here how a service like Google News might be built, and provide a simple prototype implementation of the same. Google News was apparently invented by a Krishna Bharat at Google.


At its simplest, it appears to work by collecting stories from various news sources, then categorizing them by content similarity (first broadly based on a set of standard categories like Business, Health, Politics etc. and next by collating links for the same story from different sources under a single story heading), then presenting these categories in a HTML interface for the user. In the attached code snippet, we perform the collection of news-stories indexed by date, and utilize the notion of TF-IDF for content categorization.


While building this, my first inclination was to perform the categorization (this may be viewed as simply computing the "closeness" of two news articles based on keywords within them) using the content in the URLs themselves, but URLs do not provide too much information to effectively perform this operation. This is perhaps best done by "deep filtering" i.e. gathering the articles behind the various collected URLs, then putting two URLs into the same story category if they are relatively close. We do not implement the TF-IDF clustering mechanism in the code below, nor do we implement the details of a HTML interface. The former is implemented as an example elsewhere (in the desktop search prototype project), while the latter seems to be a lot of work not directly relevant to the mechanics of implementing the current project and is therefore left out for now (we might revisit this later).


The code we use for the application, and the generated news.html file are attached to this post. These will be updated to include more comments and bells and whistles over time...


One quibble with the attached code might be the relatively small number of news stories mined. That is partly due to the stringency of the filtering criteria used (because this is a working prototype, we take some liberties with the implementation). Code follows (with sample output below)... enjoy!

import os, sys, urllib;
from datetime import *;


def locStr(x,fstr):
 r=[];
 n=x.count(fstr);
 st=0;
 while len(r)<n:
  v=x[st:].index(fstr);
  r+=[st+v];
  st+=v+len(fstr);
 return r;


def incStr(x,L,M):
 r=[];
 for i in L:
  if x.find(i)==-1: r+=[0];
  else: r+=[1];
 if M=="ANY" and sum(r)>0: return True;
 if M=="ANY" and sum(r)==0: return False;
 if M=="ALL" and sum(r)==len(L): return True;
 if M=="ALL" and sum(r)<len(L): return False;


def str0(s): 
 if s<10: return '0'+str(s);
 return str(s);


nsrcs=["nytimes","washingtonpost","latimes","cnn"];
inc=["a href=","http://"];
exc=["()","ad.","advert","facebook","twitter","javascript","click","blog","video","photo","img"];
cats=["business","sports","nation","world","politics","local","opinion","obituaries","arts","dining","health","entertainment","lifestyle"];


dctx,cx={},{};
dt=date.today();
dtstr1=str(dt.year)+"/"+str0(dt.month)+"/"+str0(dt.day);
dtstr2=str(dt.year)+str0(dt.month)+str0(dt.day);
for j in cats: cx[j]=[];


def clnUrls2(x):
 r=[];
 for i in x: 
  if incStr(i,inc,"ALL"): 
   t1=locStr(i,"a href=");
   t2=locStr(i,"</a>");
   for j in zip(t1,t2):
    s=i[j[0]+len("a href=")+1:j[1]];
    if len(s)>0 and incStr(s,["http://www."],"ALL") and incStr(s,nsrcs,"ANY") and not incStr(s,exc,"ANY") and (s.find(dtstr1)>-1 or s.find(dtstr2)>-1): 
     s1,s2=s.index("\""),s.index(">");
     str1,str2=s[:s1],s[s2+1:];
     dctx[str2]=str1;
     r+=[s];


 for i in dctx:  # collect stories by category
  for j in cats:
   if dctx[i].find(j)>-1: cx[j]+=[i];
 return r;




fc,us=[],[];


for i in nsrcs: 
 nlnk="http://www."+i+".com";
 f=urllib.urlopen(nlnk);
 fc+=f.readlines();
 f.close();


us=clnUrls2(fc);
# in us, move forward on each URL until you hit a quote - this is hyperlink
# move backwards in URL until you hit a ">" - this is the note
# save in dctx[note]=hyperlink, or even better, dctx[category]=hyperlink


g=open("urls.txt","w");
g.write("Collected Story URLs:\n")
for i in us: g.write(i+"\n");
g.write("\n\n");


g.write("Stories with URLs by Category\n");
for i in cx: 
 g.write(i+"\n");
 for j in cx[i]: 
  g.write(j+"\n");
  g.write("=> "+dctx[j]+"\n");
 g.write("\n");
g.close();


h=open("news.html","w");
h.write("Today's News:<br><br>\n\n");
for i in cx: 
 h.write(i+"<br><br>\n");
 for j in cx[i]: 
  h.write("<a href='"+dctx[j]+"'>"+j+"</a><br>\n");
  h.write("\n");
 h.write("\n<br>");
h.close();

----------Sample URLs.txt file contents:-----------------------------------------------------------------
Collected Story URLs:
http://www.nytimes.com/slideshow/2011/12/01/us/20111202-canine-ss.html"><span class="icon slideshow">Slide Show</span>: The Dogs of War
http://www.nytimes.com/2011/12/01/opinion/kristof-a-banker-speaks-with-regret.html?hp">Kristof: A Banker Speaks
http://www.nytimes.com/2011/12/01/opinion/gail-collins-mitt-romney-pardon.html?hp">Collins: Romney Pardon
http://www.nytimes.com/2011/12/01/opinion/high-stakes-little-time.html">Editorial: Extend Benefits
http://www.nytimes.com/2011/12/01/opinion/to-understand-china-look-behind-its-laws.html">Op-Ed: China’s Laws
http://www.nytimes.com/2011/12/01/garden/the-holiday-gimme-guide.html">The Holiday Gimme Guide
http://www.nytimes.com/2011/12/01/us/florida-am-university-students-death-turns-spotlight-on-hazing.html">Student&rsquo;s Death Turns Spotlight on Hazing
http://www.nytimes.com/2011/12/01/arts/music/a-review-of-the-metropolitan-operas-faust.html?ref=arts">This Faust Builds Atom Bombs (He Still Sings)
http://www.nytimes.com/2011/12/01/opinion/a-decade-of-progress-on-aids.html">Bono: A Decade of Progress on AIDS
http://www.nytimes.com/2011/12/01/fashion/at-art-basel-miami-beach-insiders-meet-newcomers-scene-city.html">Insiders Meet Art Basel Miami Beach Newcomers
http://www.nytimes.com/2011/12/01/arts/design/san-francisco-museum-of-modern-art-expansion-aims-for-friendly.html?ref=arts">An Imposing Museum Turns Warm and Fuzzy
http://www.nytimes.com/2011/12/01/greathomesanddestinations/in-tasmania-a-place-to-watch-nature-tv.html">In Tasmania, a Place to Watch &lsquo;Nature TV&rsquo;
http://www.nytimes.com/2011/12/01/garden/sprucing-up-your-front-door-for-the-season.html">The Pragmatist: Spruced Up for the Season
http://www.nytimes.com/2011/12/01/fashion/jeremy-scott-fashions-last-rebel.html">Jeremy Scott, Fashion&rsquo;s Last Rebel
http://www.washingtonpost.com/business/economy/boeing-union-reach-tentative-deal-to-end-labor-dispute/2011/12/01/gIQAc7HUHO_story.html">Boeing
http://www.washingtonpost.com/local/national-christmas-tree-lights-up/2011/12/01/gIQAJSI5HO_gallery.html">National Christmas Tree lights up
http://www.washingtonpost.com/lifestyle/wellness/arsenic-fears-aside-apple-juice-can-pose-a-health-threat-_-from-calories-nutritionists-say/2011/12/01/gIQAelLpHO_story.html?tid=pm_pop">Arsenic fears aside, apple juice can pose a health threat _ from calories, nutriti
http://www.washingtonpost.com/local/obituaries/judy-lewis-daughter-of-loretta-young-and-clark-gable-dies-at-76/2011/12/01/gIQAe85sHO_story.html?tid=pm_pop">Judy Lewis, daughter of Loretta Young and Clark Gable, dies at 76
http://www.latimes.com/news/politics/la-pn-cain-interview-20111201,0,5753490.story" target="_top" title="Cain says wife didn't know about payments to White">Cain says wife didn't know about payments to White
http://www.latimes.com/entertainment/news/la-et-dane-cook-20111201,0,2575521.story">Dane Cook heads off road
http://www.latimes.com/news/politics/la-pn-bono-congress-20111201,0,7706900.story" target="_top" title="Bono charms lawmakers in push for AIDS funding">Bono charms lawmakers in push for AIDS funding
http://www.latimes.com/news/politics/la-pn-perry-leno-ad-20111201,0,1625791.story" target="_top" title="Rick Perry ad: Self-mockery, Iowans, and what was the 3rd thing?">Rick Perry ad: Self-mockery, Iowans, and what was the 3rd thing?
http://www.latimes.com/news/politics/la-pn-cain-interview-20111201,0,5753490.story" target="_top" title="Herman Cain says wife didn't know about payments to Ginger White">Herman Cain says wife didn't know about payments to Ginger White
http://www.latimes.com/news/politics/la-pn-campaign-finance-20111201,0,1953462.story" target="_top" title="House votes to end public funding of presidential campaigns">House votes to end public funding of presidential campaigns
http://www.latimes.com/news/politics/la-pn-gingrich-campaign-20111201,0,6923880.story" target="_top" title="Newt Gingrich comeback surprises even the candidate himself">Newt Gingrich comeback surprises even the candidate himself
http://www.latimes.com/news/nationworld/world/la-fg-iran-britain-embassy-20111201,0,2833016.story?track=rss" target="_top">Britain shuts its Tehran embassy, expels Iran's diplomats
http://www.latimes.com/news/nationworld/world/la-fg-myanmar-clinton-20111201,0,7597719.story?track=rss" target="_top">Landmark Clinton visit to Myanmar includes a weapons concern
http://www.latimes.com/health/boostershots/la-heb-artificial-pancreas-fda-20111201,0,810564.story?track=rss" target="_top">Diabetes: FDA provides guidance on artificial pancreas
http://www.latimes.com/health/boostershots/la-heb-world-aids-day-roundup-20111201,0,7630943.story?track=rss" target="_top">On World AIDS Day, groups cheer progress and call for more focus
http://www.latimes.com/entertainment/news/books/la-et-book-night-eternal-box-20111201,0,1165802.story?track=rss" target="_top">'The Night Eternal' info
http://www.latimes.com/news/local/la-me-winds-20111201,0,3189154.story?track=rss" target="_top">Powerful Santa Ana winds sweep across Southland
http://www.latimes.com/news/nationworld/nation/la-na-airports-20111201,0,4817588.story?track=rss" target="_top">FAA promises changes to prevent tarmac delays
http://www.latimes.com/news/nationworld/nation/la-na-nlrb-board-20111201,0,4182874.story?track=rss" target="_top">NLRB, split along party lines, may be put out of work
http://www.latimes.com/news/nationworld/nation/la-na-blagojevich-20111201,0,7679329.story?track=rss" target="_top">Prosecutors seek stiff sentence for Blagojevich
http://www.latimes.com/news/politics/la-pn-bono-congress-20111201,0,7706900.story?track=rss" target="_top">Bono charms lawmakers in push for AIDS programs funding
http://www.latimes.com/news/politics/la-pn-cain-interview-20111201,0,5753490.story?track=rss" target="_top">Herman Cain says wife didn't know about payments to Ginger White
http://www.latimes.com/news/obituaries/la-me-passings-20111201,0,2646776.story?track=rss" target="_top">PASSINGS: Chester McGlockton, Ray Elder, Ante Markovic
http://www.latimes.com/news/obituaries/la-me-judy-lewis-20111201,0,213916.story?track=rss" target="_top">Judy Lewis dies at 76; daughter of stars Loretta Young and Clark Gable
http://www.cnn.com/2011/12/01/world/meast/egypt-muslim-brotherhood/index.html?hpt=hp_t2">Is Muslim Brotherhood's time here? 
http://www.cnn.com/2011/12/01/election/2012/cain-accusation-affair/index.html">Cain: Wife didn't know about latest accuser
http://www.cnn.com/2011/12/01/us/tennessee-crashes/index.html">1 dead in Tenn. crashes involving 176 cars
http://www.cnn.com/2011/12/01/travel/american-airlines-meal-lawsuit/index.html">Family: In-flight meal killed flier
http://www.cnn.com/2011/12/01/us/florida-suspected-hazing/index.html">911 tape reveals efforts to save drum major


Stories with URLs by Category
arts
An Imposing Museum Turns Warm and Fuzzy
=> http://www.nytimes.com/2011/12/01/arts/design/san-francisco-museum-of-modern-art-expansion-aims-for-friendly.html?ref=arts
This Faust Builds Atom Bombs (He Still Sings)
=> http://www.nytimes.com/2011/12/01/arts/music/a-review-of-the-metropolitan-operas-faust.html?ref=arts

lifestyle
Arsenic fears aside, apple juice can pose a health threat _ from calories, nutriti
=> http://www.washingtonpost.com/lifestyle/wellness/arsenic-fears-aside-apple-juice-can-pose-a-health-threat-_-from-calories-nutritionists-say/2011/12/01/gIQAelLpHO_story.html?tid=pm_pop

business
Boeing
=> http://www.washingtonpost.com/business/economy/boeing-union-reach-tentative-deal-to-end-labor-dispute/2011/12/01/gIQAc7HUHO_story.html

entertainment
Dane Cook heads off road
=> http://www.latimes.com/entertainment/news/la-et-dane-cook-20111201,0,2575521.story
'The Night Eternal' info
=> http://www.latimes.com/entertainment/news/books/la-et-book-night-eternal-box-20111201,0,1165802.story?track=rss

opinion
Kristof: A Banker Speaks
=> http://www.nytimes.com/2011/12/01/opinion/kristof-a-banker-speaks-with-regret.html?hp
Collins: Romney Pardon
=> http://www.nytimes.com/2011/12/01/opinion/gail-collins-mitt-romney-pardon.html?hp
Bono: A Decade of Progress on AIDS
=> http://www.nytimes.com/2011/12/01/opinion/a-decade-of-progress-on-aids.html
Editorial: Extend Benefits
=> http://www.nytimes.com/2011/12/01/opinion/high-stakes-little-time.html
Op-Ed: China’s Laws
=> http://www.nytimes.com/2011/12/01/opinion/to-understand-china-look-behind-its-laws.html

nation
Prosecutors seek stiff sentence for Blagojevich
=> http://www.latimes.com/news/nationworld/nation/la-na-blagojevich-20111201,0,7679329.story?track=rss
FAA promises changes to prevent tarmac delays
=> http://www.latimes.com/news/nationworld/nation/la-na-airports-20111201,0,4817588.story?track=rss
NLRB, split along party lines, may be put out of work
=> http://www.latimes.com/news/nationworld/nation/la-na-nlrb-board-20111201,0,4182874.story?track=rss
Britain shuts its Tehran embassy, expels Iran's diplomats
=> http://www.latimes.com/news/nationworld/world/la-fg-iran-britain-embassy-20111201,0,2833016.story?track=rss
In Tasmania, a Place to Watch &lsquo;Nature TV&rsquo;
=> http://www.nytimes.com/2011/12/01/greathomesanddestinations/in-tasmania-a-place-to-watch-nature-tv.html
National Christmas Tree lights up
=> http://www.washingtonpost.com/local/national-christmas-tree-lights-up/2011/12/01/gIQAJSI5HO_gallery.html
Landmark Clinton visit to Myanmar includes a weapons concern
=> http://www.latimes.com/news/nationworld/world/la-fg-myanmar-clinton-20111201,0,7597719.story?track=rss

dining

obituaries
Judy Lewis dies at 76; daughter of stars Loretta Young and Clark Gable
=> http://www.latimes.com/news/obituaries/la-me-judy-lewis-20111201,0,213916.story?track=rss
Judy Lewis, daughter of Loretta Young and Clark Gable, dies at 76
=> http://www.washingtonpost.com/local/obituaries/judy-lewis-daughter-of-loretta-young-and-clark-gable-dies-at-76/2011/12/01/gIQAe85sHO_story.html?tid=pm_pop
PASSINGS: Chester McGlockton, Ray Elder, Ante Markovic
=> http://www.latimes.com/news/obituaries/la-me-passings-20111201,0,2646776.story?track=rss

health
Arsenic fears aside, apple juice can pose a health threat _ from calories, nutriti
=> http://www.washingtonpost.com/lifestyle/wellness/arsenic-fears-aside-apple-juice-can-pose-a-health-threat-_-from-calories-nutritionists-say/2011/12/01/gIQAelLpHO_story.html?tid=pm_pop
On World AIDS Day, groups cheer progress and call for more focus
=> http://www.latimes.com/health/boostershots/la-heb-world-aids-day-roundup-20111201,0,7630943.story?track=rss
Diabetes: FDA provides guidance on artificial pancreas
=> http://www.latimes.com/health/boostershots/la-heb-artificial-pancreas-fda-20111201,0,810564.story?track=rss

world
Prosecutors seek stiff sentence for Blagojevich
=> http://www.latimes.com/news/nationworld/nation/la-na-blagojevich-20111201,0,7679329.story?track=rss
On World AIDS Day, groups cheer progress and call for more focus
=> http://www.latimes.com/health/boostershots/la-heb-world-aids-day-roundup-20111201,0,7630943.story?track=rss
FAA promises changes to prevent tarmac delays
=> http://www.latimes.com/news/nationworld/nation/la-na-airports-20111201,0,4817588.story?track=rss
NLRB, split along party lines, may be put out of work
=> http://www.latimes.com/news/nationworld/nation/la-na-nlrb-board-20111201,0,4182874.story?track=rss
Britain shuts its Tehran embassy, expels Iran's diplomats
=> http://www.latimes.com/news/nationworld/world/la-fg-iran-britain-embassy-20111201,0,2833016.story?track=rss
Landmark Clinton visit to Myanmar includes a weapons concern
=> http://www.latimes.com/news/nationworld/world/la-fg-myanmar-clinton-20111201,0,7597719.story?track=rss
Is Muslim Brotherhood's time here? 
=> http://www.cnn.com/2011/12/01/world/meast/egypt-muslim-brotherhood/index.html?hpt=hp_t2

politics
Rick Perry ad: Self-mockery, Iowans, and what was the 3rd thing?
=> http://www.latimes.com/news/politics/la-pn-perry-leno-ad-20111201,0,1625791.story
Newt Gingrich comeback surprises even the candidate himself
=> http://www.latimes.com/news/politics/la-pn-gingrich-campaign-20111201,0,6923880.story
Cain says wife didn't know about payments to White
=> http://www.latimes.com/news/politics/la-pn-cain-interview-20111201,0,5753490.story
Bono charms lawmakers in push for AIDS programs funding
=> http://www.latimes.com/news/politics/la-pn-bono-congress-20111201,0,7706900.story?track=rss
Bono charms lawmakers in push for AIDS funding
=> http://www.latimes.com/news/politics/la-pn-bono-congress-20111201,0,7706900.story
Herman Cain says wife didn't know about payments to Ginger White
=> http://www.latimes.com/news/politics/la-pn-cain-interview-20111201,0,5753490.story?track=rss
House votes to end public funding of presidential campaigns
=> http://www.latimes.com/news/politics/la-pn-campaign-finance-20111201,0,1953462.story

sports

local
Powerful Santa Ana winds sweep across Southland
=> http://www.latimes.com/news/local/la-me-winds-20111201,0,3189154.story?track=rss
Judy Lewis, daughter of Loretta Young and Clark Gable, dies at 76
=> http://www.washingtonpost.com/local/obituaries/judy-lewis-daughter-of-loretta-young-and-clark-gable-dies-at-76/2011/12/01/gIQAe85sHO_story.html?tid=pm_pop
National Christmas Tree lights up
=> http://www.washingtonpost.com/local/national-christmas-tree-lights-up/2011/12/01/gIQAJSI5HO_gallery.html


----------End of Sample URLs.txt file contents-----------------------------------------------------------

Image of the generated sample news summary html page:


Books I like

This list is a work in progress...  Might add brief reviews later...

Interview Questions:
Heard On The Street (Timothy Falcon Crack)
A Practical Guide to Quant Interviews (Xinfeng Zhou)


Technical:
The Art of Computer Programming by Donald E. Knuth



Non-Fiction:
The Black Swan by Nassim Taleb
The Man Who Knew Infinity by Robert Kanigel
The Innovator's Dilemma by Clayton Christensen

A Brief History of Time by Stephen Hawking (very difficult, though interesting, read)
Godel, Escher, Bach and Meta-magical Themas by Douglas Hofstadter
When Genius Failed by Roger Lowenstein
Five Equations That Changed The World by Michael Guillen
A Demon of Our Own Design by Richard Bookstaber

Fiction: 
Siddhartha, Magister Ludi by Hermann Hesse

The Dune Series by Frank Herbert
Lord of The Rings by J.R.R. Tolkein
Ender's Game by Orson Scott Card
Hyperion Cantos by Dan Simmons
Sacred Games by Vikram Chandra
The Great Indian Novel by Shashi Tharoor
The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle
The Poirot Series by Agatha Christie


Other favorite authors include (books too many to list above): Alastair Reynolds, P.G. Wodehouse, Alistair MacLean, Frederick Forsythe, Sidney Sheldon, ...

Quant Interview Questions

If you're looking for more questions of this kind to build up your quantitative muscle before tackling books like "Heard on the Street", here are a couple of good recommendations:
1. The basic book of quant questions (100 questions - mostly for basic review)
2. The big book of quant questions (1000 challenging questions - this is more advanced)

Have you seen this free app from ChiPrime that tests your basic math even before you try the puzzlers below? Or you can try this web-page (but the app puts all questions in one place at your finger-tips). If you find the below too easy, then try this other ChiPrime app.

New questions posted periodically with answers here.

[click here for 2021 new questions]

NEW!!!! ChiPrime now on Reddit!

NEW!!!! Any of the questions in the medium to hard category from the ChiPrime adaptive math quiz should be fair game at a technical interview - I would expect a competent candidate to solve it in 2 minutes or less.

(c) ChiPrime 2014

Here, I capture some puzzlers I ask/was asked at interviews or by friends:
  1. (warm up) a two digit number ab (a and b are the digits) when multiplied by 6 gives bbb. What is ab?
  2. (warm up) a number N divides 17 and 30 leaving the same remainder. What is the largest possible value of N?
  3. (warm up) what is the value of sqrt(7+sqrt(48)) - sqrt(3)?
  4. (warm up) what is the value of ln(ln(i))?
  5. What are the values of sqrt(2+sqrt(2+sqrt(2+.... ad inf? and 2+1/(2+1/(2+... ad inf?
  6. Lattice points are points in space all of whose coordinates are integers. How many lattice points are on the surface of a sphere of radius 1? 2? 45? Can you determine how many lattice points exist on the surface of a sphere of radius 10^10?
  7. What is the angle at the center of a circle such that the triangle formed by the two radii and the chord connecting them where they meet the circle, has the largest area?
  8. Given a chess board with a corner square removed and a number of L-shaped pieces composed of 3 squares each, tell how you could tile the rest of the plane with them. How many such pieces would you need? 
  9. *** (cont'd) Given a full chess-board now, can you say which other square you can remove so that tiling the rest of the full plane is still possible? Can you prove a full-tiling is not possible if any of the other squares were to be removed instead?
  10. Write a computer program that, given a number e.g. 2391, and a set of binary operations e.g.(+,-,*,/), returns the set of all unique numbers that can be generated by applying these operations to the digits of the number? Extra points if each number in the returned data structure also points to one or more calculations that result in that number.
  11. (from MIT Technology Review) what is the smallest number such that if you take the digit in the units place and move it to the front moving all the digits back one, you are left with double the original number? i.e. if you start with 132, you get 213, you want the smallest number so this transformation doubles the original number. What is it?
  12. What four points on the surface of a sphere are the furthest apart from each other? eight? three?
  13. Which is greater e**pi or pi**e where e is the base of natural logarithms? You should be able to prove this without pen and paper, using simple calculus.
  14. Given a random number generator that generates random integer values between 1 and 5 (both inclusive), how would you construct a random number generator that can generate integer values between 1 and 7 both inclusive?
  15. Generate all combinations of 3 letters in the English alphabet. Randomly delete 90% of these "words". Words are friends if two of them differ in only one letter. A Social Network of a word x is the list of all its friends, their friends etc. so all connected nodes emanating from x are part of its Social Network. Write a program to compute the complete social network of any random word in the set. What is the size of the social network of the most popular words? the loneliest words?
  16. Given six points in three dimensional space, what is the maximum number of triangles you can form with them as vertices? the minimum?
  17. (from "Heard on the Street") what is the smallest number such that it leaves a remainder of 1 when divided by 2, a remainder of 2 when divided by 3, ... a remainder of 9 when divided by 10?
  18. given a fair coin, how many tosses do you need on average before you get your first tail?
  19. what is the probability that given six throws of a fair die, every face shows up once?
  20. Show how to solve using Monte Carlo simulation: given a stick of length L, what is the probability that if I break it into three pieces, the three pieces can form a triangle?
  21. Walmart sells a simple calendar in which there are two small wooden cubes used to display the date - there is one number on each face. Given two cubes, decide how you would code the digits on the surfaces of the cubes such that using them together you can display all dates in any given month (essentially all numbers 1-31).
  22. given a ruler (straight edge) and a compass with which you can draw circles of any radius you choose, devise a means of dividing a segment of length "a" also given to you, into 7 equal parts.
  23. two trains 60 mi apart, and tratveling 40 mph and 60 mph respectively, are headed towards each other on a track. a bird that flies 100 mph leaves the first train for the second, touches the second then comes back to the first, and repeats this trip till it is crushed between the trains when they crash. how far does the bird travel before it dies?
  24. given an equilateral triangle ABC, and any point P inside it, what is the sum of the lengths of the perpendiculars from P to each of the three sides?
  25. with a throw of a normal pair of dice, the probability of getting numbers between 1 and 12 is not equal. design a pair of dice (i.e. tell what positive integers on each face) so the probability of getting all numbers from 1 to 12 is the same. (Sicherman dice problem)
  26. given a square ABCD and equilateral triangle ABP with P inside the square, what is the measure of angle ACP?
  27. if you regress Y on X (i.e. you end up with Y=alpha1+beta1*X+epsilon1), and then you regress X on Y (yielding X=alpha2+beta2*X+epsilon2), is there a mathematical relationship between beta1 and beta2? what is it?
  28. what is the angle at the center of a regular tetrahedron?
  29. imagine a Rubik cube 10x10x10 floating in the air. if the outer layer of cubelets fall off, how many cubelets are on the ground?
  30. how many squares on a chess board? (hint: the answer is not 64)
  31. imagine a large cube that represents a room. At two opposite corners of the cube are a pot of honey and a spider. The spider can only move along the edges of the room-cube. How many moves on average must the spider make before it reaches the honey?
  32. (same setup as last problem). what is the shortest distance the spider needs to travel, if it can move along the surfaces of the cube (i.e. not just along the edges) if it wants to reach the honey?
  33. the interviewer shows you a wall with two holes: one a perfect circle, the second a square (assume whatever dimensions you want for the two). you are asked to present the simplest solid object that can plug both holes perfectly. what is it?
  34. what is special about 1729? how do e, i and pi relate? what are quaternions?
  35. compute the cube root of 8.01 to (say) 3 decimal places. calculate the cube of 33 verbally.
  36. (***) given a compass (to draw circles of any radius you wish), and a straight edge (ruler) with only 3 marks on it for 0, 1 unit, and a units, devise a means of constructing a segment of length a^(3/2). 
  37. (***) four circles each with radius R and each centered on a vertex of a square ABCD of side R intersect inside it to form a curvilinear quadrilateral PQRS - this is the area common to all four circles. what is the area PQRS?
  38. (***) "google bill-board puzzle". what is the first 10-digit prime number you can find in the digits of e?
  39. (***) write a program to generate the first million digits in the square root of 2.
  40. how would you solve the simultaneous equations: x^2+y=7, x+y^2=11?
  41. prove that for any positive real numbers a, b, c:  (a+b+c)*(1/a+1/b+1/c) >= 9.
  42. let x and y be real numbers such that x.sqrt(1-y^2) + y.sqrt(1-x^2)=1. prove that x^2+y^2=1.
  43. given log(2)=0.6931, compute the approximate value of (2.1)^(2.1) verbally.
  44. given a square matrix A, devise a way of computing e^A where e is the transcendental number that forms the base of natural logarithms. Under what conditions is this computation possible?
  45. (goldman sachs) two friends agree to meet at the train station between 1 and 2 pm tomorrow. They also agree that if one of them shows up and the other doesn't, they will wait 15 mins and then leave. What is the probability that the meeting takes place?
  46. (goldman sachs) given a linked list and two pointers, how will you determine if there is a loop in the linked list?
  47. what is the sum of the first 500 digits after the decimal point in the square root of 2? how would you go about computing this value?
  48. given two coins, one which is fair and a second where the probability of heads is 3/4, which coin was the more likely one flipped if you are told you got 6 heads out of 10 flips? 7 heads?
  49. What is a partial fraction? a partial function? a lambda expression? a monad?
  50. (C puzzle) assume the program includes stdio.h. What is the output of the following code? Error? Why?  int main() { char* p="this is a test"; printf("%c",3[p]); return 0;}
  51. Prove the formula for the area of a circle using the fact that the circle is a limiting case of a polygon with infinite vertices.
  52. (python puzzle)
      x=4*[[]] # x is now [[],[],[],[]]
      for i in range(len(x)): x[i]+=[i];
      print x;
      # this returns [[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]]

      now if I do instead: 
      x=[[],[],[],[]];
      for i in range(len(x)): x[i]+=[i];
      print x;
      # this returns [[0],[1],[2],[3]]
      The question is, why?