Medusa's Cave: January 2015

Saturday, January 31, 2015

ML applications: news aggregation, MBA admissions

In studying Machine Learning, we always of course want to learn what kinds of applications we can target with any particular method. In this post we look at a couple.

News aggregators - sites like Google News need a means of aggregating the links of the same news story from different news broadcasting sites together. In previous posts we have looked at how we can identify the closeness of different news stories by constructing the reverse indices of various documents and then finding the distance between them in N-dimensional space defined by the statistically improbable words in each document as a guide - speaking in terms of vectors that computes the angle between any two vectors each representing a unique news article, and then picking those that have the smallest angle as belonging to the same story. We also looked at using Robust Hyperlinks as another equivalent method of intercepting dead links to a topic (maybe a news topic) of interest and then using the statistically improbable terms in the search or from the URL to find other links to the same story though perhaps from other sites, and forward on that content instead if the requested link is dead. A different way of achieving comparable results might be to instead cluster the documents in N-dimensional space where N is defined as the top 5 or 10 words in each article's tf-idf score set. This has similar characteristics, uses unsupervised learning, and is able to bucket a large number of articles in each go-around as news happens and new articles are published with similar tf-idf terms as those belonging to other articles that tie to the same story.
MBA admissions and hiring in large corporations follow an interesting pattern. These are similar in many ways, but to keep things non-controversial let's just use the B-school example. Say for instance, you are targeting a particular B-school. Let's say you have a fantastic GMAT score, a great GPA, excellent extra-curriculars, terrific letters of recommendations, meet all the right criteria for age, academic pedigree for previous degrees, etc. How will the school decide whether or not to admit you? Well, the school of course wants to admit the best students, but what constitutes "the best" might be an illusion. Let's look at it this way. If a school says they want to have "as diverse a class as possible" and by that they factor in nationalities, competencies, pre-MBA careers, ages of students, GMAT scores, pre-MBA schools, earlier degrees, etc, then what they really mean is that rather than taking their cutoffs based on a global maximum and working their way down the hill till they give out seats to all the students that qualify before they run out, they probably cluster all students along various criteria in N-dimensional space and then pick "the best" students to admit from each cluster. This explains perfectly why if you are a male software engineer from Asia with a 780 GMAT, a background from one of the top schools in that nation, and stellar letters of recommendation, you may still lose out to a female lawyer from some other country in the developed world that is under-represented in the program/school, with a very respectable 720 GMAT, and stellar metrics in all other ways. The more there are comparables to you that apply, the harder it is for you to excel in that pool. Your class of applicant may destroy the GMAT percentile curve for all other classes of applicant, but the classes of applicant compete amongst themselves for seats available, competition across classes might not count for as much as competition within a class - there things like age, academic pedigree, past successes etc may count for more in whether or not you get offered a seat. So how does this help you, you ask? Well, if, from the list of all the factors that determine admission into a top MBA program, you are able to determine uniquely for your target school which features fall into the set they cluster on, and which features they use to differentiate candidates within the cluster, you can more effectively prepare your application to get an admission - of course, you cannot change who you are when you apply (so your cluster is likely pre-defined depending on the inflexible characteristics that make you, you), but you can present your story differently to better position yourself within your cluster. Also, it is more likely that the "qualitative" factors like your essays etc become relevant after cut-off has already been done on your cluster, you if you know the dimensionality of the clustering algorithm each school uses, and what features are actually used in that algorithm, this could help you.

Friday, January 30, 2015

Learning Machine Learning

Lately I have spent a lot of time re-learning Machine Learning from scratch (to both reinforce what I once knew but forgot, as well as to build more extensive data analytics and model building muscle that is ever so useful at work these days). This ties in well with my interest in biologically inspired algorithms (genetic algorithms etc), variants of which are widely used in fields like Finance - for example if you build a Stochastic Volatility model for Option Pricing like the Heston or SABR, or the Bates model for Stochastic Volatility with Jump Diffusion, typically you will have to use an algorithm like differential evolution which is a variant (or maybe a special kind) of a genetic algorithm. You will also have to use techniques like partial functions also called currying or schonfinkelization.

Anyway, here is what I have done so far. This is one of but many paths possible to learn this material. Each line here takes 30+ hours to get through, but you will improve your understanding of both the underlying mathematics as well as your ability to actually implement the ideas in a work setting if you spend the time here. Here's the path that worked for me. Yes, the material in some of the lectures below overlaps, but I learn best when I see the same material presented by different people who come at it from different angles.

In terms of prerequisites, a liking for mathematics and a somewhat minimal knowledge of multivariate algebra, multivariate calculus (at least basic vector calculus including the notions of div, and grad would be useful), and some multivariate statistics would be useful. Completing MIT OCW courses like 6.004 and 6.042J or equivalently, the algorithms class offered by Tim Roughgarden (Stanford) or Sedgewick (Princeton) would be good. Those, and of course, the desire to work hard through the material when things get a little difficult.

(*) are next to what I felt were the best quality courses. Of course, your mileage may vary. Some of these are difficult and require serious work.

(*) Trevor Hastie and Robert Tibshirani's Lectures at Stanford - very accessible even without too much of a math background. This is truly phenomenal. And what great guys, their textbooks are legally free to download. Hats off to them!
(*) Yaser Abu Mostafa's extremely well-done lectures at Caltech on Machine Learning also cover lots of theory with mathematics (learning theory sections are a bit challenging, but very necessary). Very clear explanations.
(*) Professor Andrew Ng's lectures again at Stanford have more of a practical feel to them. Again extremely well done. I took the actual Stanford class (slightly more difficult, but totally worth it), not the somewhat diluted Coursera one.
(*) Plan on re-taking Professor Koller's Stanford course on Probabilistic Graphical Models - demands a lot of work and paying close attention, very challenging at times, but definitely worth the effort. Re-taking to ensure I understand everything correctly.
(*) Geoffrey Hinton's lectures from the University of Toronto on Neural Nets. These are on Coursera, the content is excellent and extremely clear.
(*) Coursera lectures on Mining Massive Datasets by Anand Rajaraman and Jeffrey Ullman that follow along the lines of their free book published some time ago. This course is very good but requires quite a bit of work.
(*) Taking MIT 6.006 (Introduction to Algorithms), 6.034 (Introduction to AI), and 6.042J (Mathematics for Computer Science). Love Srini Devadas,Tom Leighton, and the other instructors - very gifted teachers.
A Practical Machine Learning course offered by Johns Hopkins. This was quite a bit easier after all of the above.
Completed the University of Washington Machine Learning on Coursera - interesting difference here is that it is case-study based and application oriented. Also quite a bit easier than the starred courses above.
I plan on working my way through MIT lectures on Advanced Probability Reasoning taught by Dr. Tsitsiklis (6.041), and the Harvard CS109 class on Statistics and Analytics also online.

I have already made significant use of the methods at work - building models utilizing techniques from Neural Networks, Support Vector Machines, and Logistic Regression, and overall the lectures were so clear I understood exactly what I did in each case, what decisions I made, and why.

I will update the post with more material as I learn more machine learning.

Thursday, January 29, 2015

ChiPrime quant prep video channel!

ChiPrime is the competitive exam quant prep portal that offers free, computer adaptive, high quality, targeted content to ace the math portions of exams like the GMAT and the GRE that I wrote about in earlier posts. It looks like they have recently made a couple of improvements to their website. (Full disclosure: I do support this website, and help them maintain their high quality standards.)

One, they now offer free lessons on the content of the exams, broken down into several easy to understand modules with logical flow and worked examples. The modules build upon each other quite well.

Two, they also have a video channel on YouTube here. Not too many videos on it just yet, but the focus there is to add more quality content slowly but surely.

If you use it and like it, please like it on Facebook to help spread the word!