Saturday, January 31, 2015

ML applications: news aggregation, MBA admissions

In studying Machine Learning, we always of course want to learn what kinds of applications we can target with any particular method. In this post we look at a couple.


  1. News aggregators - sites like Google News need a means of aggregating the links of the same news story from different news broadcasting sites together. In previous posts we have looked at how we can identify the closeness of different news stories by constructing the reverse indices of various documents and then finding the distance between them in N-dimensional space defined by the statistically improbable words in each document as a guide - speaking in terms of vectors that computes the angle between any two vectors each representing a unique news article, and then picking those that have the smallest angle as belonging to the same story. We also looked at using Robust Hyperlinks as another equivalent method of intercepting dead links to a topic (maybe a news topic) of interest and then using the statistically improbable terms in the search or from the URL to find other links to the same story though perhaps from other sites, and forward on that content instead if the requested link is dead. A different way of achieving comparable results might be to instead cluster the documents in N-dimensional space where N is defined as the top 5 or 10 words in each article's tf-idf score set. This has similar characteristics, uses unsupervised learning, and is able to bucket a large number of articles in each go-around as news happens and new articles are published with similar tf-idf terms as those belonging to other articles that tie to the same story.
  2. MBA admissions and hiring in large corporations follow an interesting pattern. These are similar in many ways, but to keep things non-controversial let's just use the B-school example. Say for instance, you are targeting a particular B-school. Let's say you have a fantastic GMAT score, a great GPA, excellent extra-curriculars, terrific letters of recommendations, meet all the right criteria for age, academic pedigree for previous degrees, etc. How will the school decide whether or not to admit you? Well, the school of course wants to admit the best students, but what constitutes "the best" might be an illusion. Let's look at it this way. If a school says they want to have "as diverse a class as possible" and by that they factor in nationalities, competencies, pre-MBA careers, ages of students, GMAT scores, pre-MBA schools, earlier degrees, etc, then what they really mean is that rather than taking their cutoffs based on a global maximum and working their way down the hill till they give out seats to all the students that qualify before they run out, they probably cluster all students along various criteria in N-dimensional space and then pick "the best" students to admit from each cluster. This explains perfectly why if you are a male software engineer from Asia with a 780 GMAT, a background from one of the top schools in that nation, and stellar letters of recommendation, you may still lose out to a female lawyer from some other country in the developed world that is under-represented in the program/school, with a very respectable 720 GMAT, and stellar metrics in all other ways. The more there are comparables to you that apply, the harder it is for you to excel in that pool. Your class of applicant may destroy the GMAT percentile curve for all other classes of applicant, but the classes of applicant compete amongst themselves for seats available, competition across classes might not count for as much as competition within a class - there things like age, academic pedigree, past successes etc may count for more in whether or not you get offered a seat. So how does this help you, you ask? Well, if, from the list of all the factors that determine admission into a top MBA program, you are able to determine uniquely for your target school which features fall into the set they cluster on, and which features they use to differentiate candidates within the cluster, you can more effectively prepare your application to get an admission - of course, you cannot change who you are when you apply (so your cluster is likely pre-defined depending on the inflexible characteristics that make you, you), but you can present your story differently to better position yourself within your cluster. Also, it is more likely that the "qualitative" factors like your essays etc become relevant after cut-off has already been done on your cluster, you if you know the dimensionality of the clustering algorithm each school uses, and what features are actually used in that algorithm, this could help you.