Medusa's Cave: June 2012

Wednesday, June 20, 2012

How (not) to conduct a Case Interview

There are several books that advise students and busy professionals on how to prepare for a Case Interview. An excellent one is Case In Point. However, having conducted and been interviewed as a candidate several times in Case Interview settings over many years, I note that many interviewers tend to lack skill in conducting Case Interviews. This is sad, quite embarrassing and many a time a dis-service to the organization the interviewer works for. A poorly conducted case interview can hurt an otherwise worthy candidate's career prospects, and cost your company an attractive hire. If a Case Interviewer does a bad job, it is unlikely she will permit the candidate to advance even if he performed well given the interview constraints. The interviewer's ego will simply not permit it. And a candidate's protests are worthless since he'll be viewed as incompetent and disgruntled because he did not advance.

Of course, the most typical Case Interview setting still remains one where a prepared interviewer meets a not-as-well-prepared candidate, with both sides being prepared for the encounter being something of an exception rather than the rule.

What happens when an unprepared Case Interviewer meets an unprepared Case Interviewee? This falls into the realm of questions like, "when a tree falls in a forest with no one there to hear it, does it make a sound?" And we ignore it in this post.

So what are some key "rules of thumb" for Case Interview interviewers? We present a brief, and not exhaustive list here in what follows:

Do your homework. Know the case question you are asking. Know it well. Know the context/setting, the problem, possible solutions, and paths to get there. Most importantly, know what "a correct answer" looks like.
Know what direction(s) you are willing to let the candidate take, guide him in those directions. Do not perform this function haphazardly, pushing one way and then another, because you yourself do not know how one should proceed from start to finish.
Scope the problem beforehand. You can have multiple objectives, but you must decouple them and ask the candidate to solve each piece first before going on to the next. Do not cut back and forth between multiple questions. This confuses both yourself and the candidate.
Know what questions you expect the candidate to ask, what data you want to present them when they do, and how you want to respond if they ask questions you did not anticipate.
Be reasonable with the case problems you make up. There has to be a way to get from your question or premise to an acceptable answer. Well prepared candidates can crack reasonable questions in half the time you think needs to be allotted. By the same token, if the question is poorly defined, or you base it on some vague idea in your mind, no candidate will be able to finish it in time.
Know what assumptions make sense, and which ones you are willing to accept. If the prepared candidate asks "I am assuming there are 300M people in the US, and the average life-span in the US is around 80 years, is that reasonable?". Feel free to say no, and ask for clarifications, but once you say yes, and the candidate does all the math based on these assumptions, don't backtrack 15 mins into the problem telling him the base assumptions are incorrect.
Last, but most important, do not penalize the candidate for your lack of preparation. If the "answer" the candidate comes up with, based on assumptions you approved earlier does not look right, gently ask the candidate to redo the calculations with more reasonable assumptions, but acknowledge that your approved assumptions were somehow incorrect.

The above are only some ideas on how prepared interviewers might conduct a Case Interview. Note however that there are exceptions to every rule. An interviewer is justified in breaking one or more of the above rules if she is conducting a "confrontational" or "hostile" interview to test the candidate's mettle in a final round. These are designed to see how the candidate handles pressure, uncertainty and difficult situations. But such interview scenarios evolving naturally (instead of deliberately) due to lack of preparation is unforgivable. Good interviewees spend many weeks of effort mastering the Case Interview technique. Hurting someone's career potential because you are unprepared is criminal. Do not conduct Case Interviews if unprepared.

Reverse Inboxes

In this post, the author examines his social network by analyzing his email. Email headers - when both the sent and inbox folders are considered together, give strong indications of who sends email that is inbound, and who one sends email out to. This, when taken together with the number of times a particular unique identifier appears in the To: or the CC: or the BCC: fields of the email messages - perhaps entries in these three different types of fields are weighted differently - gives a pretty good indication of the strength of one's professional or personal relationships with other people.

We can fine-tune the mechanical classification described in the earlier paragraph a little bit further if we were to perform either sentiment analysis or some other text data mining in order to determine if a. the content is of a personal or a professional nature, and b. whether the email document is positive, negative or neutral in terms of expressed sentiment. These things together can be used to construct a network or graph of the relationships between the various people in one's communications sphere.

One can, if one performs this exercise with various people's email boxes, also determine if one person's network is more dense and more completely connected than another's and whether the span or expanse of one person's social network is greater than another's. This might also permit us to differentiate between people's personal and professional networks, and the set of people (or nodes) that straddle the two domains.

We posit that it is possible to have a nodding acquaintance with lots of people, and know some people really well - it's hard to know many people really well - and this is something we can confirm by performing this "reverse inbox analysis" for various people's inboxes.

To perform this analysis effectively while maintaining the author's privacy, rather than present the email addresses of individual users from the author's mailbox, an MD5 hash of the email address concatenated with a fixed random string of data is used instead. This hash value is repeatable but randomized, so provides an individualized marker by email address though the identity embedded in the email address is not visible.

If one were to compute the reverse inbox based social networks for different users, and then connect them together, a very complete view of the organization's or community's social network as a whole would emerge. We discuss this in a related post on "Mining your Social Network". We content ourselves with providing a simple reverse inbox implementation in this post.

[code to follow shortly]

Saturday, June 2, 2012

The Itsy Bitsy Spider

In nature, spiders are beautiful things. Yes, they can be classified as creepy things when we look at particular specimens, with their long spindly legs and abrupt quick movements. But viewed in the abstract, they do some very unique things - for one, the fibers they emit to build spider-webs are considered to be stronger and more resilient than steel fibers of the same diameter. But in this blog we talk about a different kind of spider altogether.

The WWW is a mesh or web of hyperlinks. What better creature to crawl this abstract space and extract information from this structure of data and knowledge than a spider - and indeed, programs that perform this function are called spiders. In this post, we present a simple Python spider program that crawls the web. (a snake and a spider in one sentence ...)

Let us look at some applications of this kind of a program.

let us say you are trying to scrape all content that hangs off a single web-page across a set of multiple hyperlinks. One way of gathering all this content would be to write a program that iteratively processes each link on the page and collects the data it reads, organizing it into separate files. Freeware programs like HTTrack, the website copier, do just this. But this is a severely limited spider since it is restricted to just those links that either a. hang off the specified web-page, or to links that are hosted by a particular domain or sub-domain of the internet e.g. www.my-company.com
the more exotic application - document retrieval for web search. For Google to be able to respond to search queries, it must first build an indexed data set of various documents from the web. To do this, Google unleashes multiple multi-threaded spiders onto the web, and they collect all documents it makes sense to collect (using some intelligent criteria), following links from one document to the next, until a large part of the web is gathered. Google then indexes these documents to construct a data set that can be rapidly queried to respond to user queries.
Avinash Kaushik, in his excellent books, distinguishes between internal (as in, within a company's intranet, for employees and other internal users) and external search applications with a focus on web analytics. These two perhaps do merit special discussion even in this simple classification which is why we include this bullet item here. However the mechanics for (3) are very similar to those for (2) from a web-crawler perspective, with some of the limitations from (1) imposed on the spiders in question.
desktop search - except here the spider crawls through the directory tree structure of one's file system instead of the Internet, but the operational principle is the same.

We focus our implementation on a very simple instance of (2) above. Also, there is etiquette one needs to follow for building spiders (e.g. the robots exclusion protocol), and since we are not building an industrial strength implementation but one just for fun, we gloss over these details and "hobble" our implementation, constraining it to not explore more than a pre-set maximum number of links before it halts.

Also, for the layperson reading this, while it is somewhat glamorous to visualize this process as the spider crawling the WWW, what really happens of course is that the spider program is resident on the host computer running it, it is simply downloading html documents from the specified starting site, following hyperlinks wherever they may lead, then downloading those other documents,... and so on.

A comparison of how many web-pages were indexed by the various web search engines shows how up to a point, increasing the index size improves search quality. We say "up to a point" because a web search engine called Cuil (2008-2010), indexed slightly over 120B web-pages at one time, but an unfortunate inability to produce search results of comparable quality to other leading search engines led to its early demise. To be a popular web search destination, it is not just important to index as many meaningful, non-repeating, non-machine-(auto) generated, non-interstitial-screen pages from the web, it is also important to return results that provide the kinds of information the user is looking for.

Code is below. Sample output follows the code.

import os, sys, urllib2, re;

u=sys.argv[1]; # the url to start with
lynx=[u]; # store the start url into the download set
for u in lynx: #process each link in turn
r=urllib2.urlopen(u); # get the requested object
x=r.read(); # read the document from the website
xlines=x.split(" "); # split and process the lines
c=[i for i in xlines if i.lower().find("http://")>-1]; # collect all hyperlinks
for i in c: # process each hyperlink, ignoring some words
if i.find("CALLOUT|")>-1: continue;
stpos=i.index("http://");
if i.find(".html")==-1: continue;
if i.find("\n")>-1: continue;
enpos=[]; # code segment that follows parses the retrieved document
enpos+=[k.start() for k in re.finditer(".html",i)];
#print stpos,enpos;
enpos=[j for j in enpos if j>stpos+13];
if len(enpos)==0: continue;
enpos=min(enpos);
candidate=i[stpos:enpos]+".html"; # store retrieved urls
if candidate not in lynx: lynx+=[candidate];
if len(lynx)>100: break; # hobble the spider so it doesn't go haywire

g=open("t.txt","w"); # write crawled urls to file
for i in lynx: g.write(i+"\n");
g.close();