Saturday, January 21, 2012

On the design of Recommender Systems

Recommender Systems or Recommendation Systems (henceforth RS) are the systems used in many popular websites today to drive up traffic, sales, and overall customer satisfaction. These are used successfully in many Internet Services websites like Amazon, Netflix etc., to generate more revenue. When implemented and used correctly, these systems can massively impact business metrics driving up growth, revenues, and yes, even profitability (though this last metric also depends on other business parameters). In this blog post, we study some design factors involved in the construction of such systems.

Disclaimer: we do not claim any of these ideas below are original. We will try and add references later for some of the more important ideas, as time permits, to help the interested reader along a voyage of discovery.

1. Frequent Itemsets
This idea comes from the field of marketing analytics. Did you know that beer and baby diaper sales in the US are highly correlated? Apparently when dads go (or are sent) to the nearby store to pick up baby diapers, they take it upon themselves to restock on their beer to make better use of the trip, help save the environment (drive to store once, buy more things ...), etc. Stores that have customer loyalty programs can pick up on these kinds of trends more easily, since they know the content of most "shopping baskets", and can decide which sets of items (called frequent itemsets), go together most often, and are purchased most frequently. This way, they can decide to upon appropriate marketing strategies to maximize their revenue - e.g. have sales on both beer and diapers at the same time, or say give out larger discounts on diapers than neighboring stores, but hike up the price of beer to compensate etc.,

Finding and tracking frequent itemsets is something of an art form, and requires data mining and statistical analysis. Different firms do this differently with different levels of sophistication, and more complex methods are not necessarily better all the time...

Online stores can capitalize on this idea in other different and interesting ways. An RS for Amazon or other online store can for example direct people who have purchased certain items, to other items people with similar purchasing histories may have bought or may have considered buying (e.g. by tracking their page views). Here, the website is using information about customers and prospective customers to guide others along the same path. 

The logic goes: "Buying a toy? Perhaps batteries are needed. 95% of people buying this toy also bought batteries at the same time. And oh, 40% of those buying this toy wanted this cool rubber casing with it too. And did you consider how much nicer this expensive toy would be with this mod so many others like to buy? ... " and so on. 

Or... like "Cool Hand Luke"? Maybe you will like other movies that people who liked this movie will like? Perhaps it is more likely you will like movies that others liked if they also liked most movies that you say you liked, and less of the movies you say you do not like? (this is where the user ratings come in handy at Netflix, so your "basket" is the set of all movies you have rated, and can be compared with other baskets to find those that have similar characteristics to locate those that have other items your's does not have, and recommend these as movies to watch).

Not too intrusive (not overtly anyway), but helpfully plugging just the right products for you to find. And of course, there is a search facility that lets you find not just what you need, but multiple options for similar things, all neatly indexed in decreasing order of potential relevance, with the option for you to sort things along other parameters you might like.

2. The Wisdom of Crowds
Ever wonder what makes so much more successful than other book stores with an online presence? A key factor is the facility that enables customers to share their experiences about a product or service sold through the store-front. For instance, if on site A, I can read all about a book I want to buy, see other customers' reviews of the same, see customer posted product pictures, also see other vendors' prices with the ability to comparison shop, and share my feedback either positive or negative with the larger body of consumers, why would I not pick this site over competitors', especially if I can also get better deals there?

MBAs are usually taught that some platforms are "multi-sided markets" (think credit cards, or development APIs for example). Developing such platforms from the ground up can be a challenge, because you need to simultaneously grow all communities around the platform. But once there is sufficient interest and a core ecosystem becomes functional, it takes off in a virtuous cycle and things can grow very rapidly and the business takes off. Of course, in an environment with increasing "clock-speed" (reduced cycle times), successes and failures typically become evident very quickly as well. Agile companies can just as quickly increase investment in ideas that seem to work, and kill those that seem to be doing badly. But we digress...

With the advent of media like Facebook and Twitter, many websites now permit users to share their purchases with friends from those and other communities. This creates a following... "I am more likely to like things my friends like, so perhaps I should check out the things they are buying..." and so on. This is kind of a subliminal recommendation system. Works on the sly. See model (5) below for more.

Wikipedia is, to date, perhaps the best positive example of the wisdom of crowds. But even there, the truly wise are very few... and there are far far more readers than writers. The same applies to product ratings. Most people will at best, only rate a product. People that write reviews will be a much smaller number. And those that do this without a hidden motive (there have been several reported instances of restaurants reviewing themselves over Yelp!, of authors reviewing their own books glowingly on Amazon etc) would be an even smaller subset.

3. Free Samples for All
This builds on ideas from (1) above. So much better to give people the option to "try before you buy". Consumption of books, movies, songs etc. require an investment of time, money, effort on part of the consumer. Can you make them understand that the 2 hrs it takes to watch this movie, the $12.95 it takes to buy it, is actually worth it, as opposed to, say doing something else with that money and time? Give them a free preview. Tell them "no pressure, we think this is something you'd like, and as you can see, it is really not that expensive. Check out how much other people are selling it for. And look how much other people like you seem to like it! just buy it now from us, and don't worry about it till it is time to pay your credit card bill". That's what free samples do. Of course, occasionally you will have free-loaders that will take your samples and never buy, but that's a chance you have to take.

4. A Thematic Match
Sometimes, it is also convenient and useful for systems to generate a thematic match. If you like certain movies for example, rather than look for movies that others who liked your favorites have watched (which requires gobs of data), why not recommend other movies with similar thematic elements that will in all likelihood interest you? Or perhaps you liked the movies you liked because of the actor or actors in them? So other movies with the same "elements" - defined as actors, story lines etc that are similar might interest you? Of course to find exactly what makes someone like a certain movie requires some deep statistical analysis of what similar elements exist across their viewing history, or (here's a novel idea!) asking them to tell you what kinds of movies they like. "Do you like to watch mysteries frequently? sometimes? never? What about dark, gritty, thrillers?" 

Thematic match is an interesting scheme that can get results quickly with more information from the public domain, and some patient customer prodding. Easier to get started with this, build a community of loyal users, then exploit the data they leave behind as they engage more actively with your website giving you more data to mine with models (1) and (2) above, giving you more sales, encouraging your content suppliers to give you more content with free samplers - model (3) - (now that your ecosystem is humming along), and so on...

5. Link to Social Media
Think also of social media used to create a buzz. This is a key new development in our age. "OMG! I just read Girl With The Dragon Tattoo and loved it!" someone tweets or posts on Facebook. And immediately her network of 132 people see this, some of whom will either buy or borrow this book and read it. Things go viral fast. Used to be that it takes years to build a reputation that takes only a moment to destroy. It seems these days you can build and destroy reputations (ok, marketing buzz) in minutes... and you can amplify this effect if there are groups set up around certain themes, because then the uptake for anything posted into this forum would be very high, and you could get a large spike in volumes (either sales or whatever else, depending on media) very quickly.

6. The Human Unchained
Most importantly of all, unleash the human within your consumer. Exploit their need to feel heard, to talk, to feel good about what they have found that makes them smarter, wiser, more beautiful, special in their own way when they compare themselves with those around them. Encourage them to "like" products. Use them to describe their experiences using these products. Then mine these descriptions for statistically improbable phrases (SIPs - see the post on tf-idf for more) that more likely and more readily associate themselves with particular products. Use these words or phrases to help you with your thematic match (model 4 above).

Captchas were a great invention to prevent bots from harvesting Internet information for unscrupulous use. However, crafty websites can leverage human intelligence by requiring humans to solve a "flow through" captcha to unlock information from the original website for bot consumption. These and related ideas can be used both for good and bad... so caution is warranted. 

Good implementations will mine data in the aggregate, or, if they use data tied to particular individuals, will irrevocably destroy the link between the identifying characteristics of the individual in question and the data set that pertains to them. Either of these conditions must be met for safe implementations of data mining for user-centric analytics.

Other examples of human factors use in similar systems: 

In India, the police have taken to using Facebook to permit people on the road to report traffic violations involving other people's vehicles. Photographs or video of offenders can be posted online via Facebook or other social media sites, and the law enforcement machinery will take action as and where deemed necessary. Be careful, Big Brother is watching you, and your big brothers are your fellow citizens - in a more than figurative fraternal sense.

Want to hear a weather report for free? Call us, then hear an ad and tell us what you think, service is "free".

7. Putting it all together
As we have remarked over the course of this discussion, some elements require less data and less user participation than others, and can be built more easily. Then, as one's business grows, one can leverage the larger data set gleaned from the user base to build more sophisticated analytics to better exploit their information to provide better service to increase loyalty, lock in these customers by increasing barries to switching, or provide deals tailored to consumer segments that make it more attractive for them to return time and again.

Sadly, building these kinds of systems, even rudimentary ones, requires a rather substantial amount of data, and this data is not easily available to people like us who are interested in exploring the leading edges of technology, so for now, we close this post with out a sample code while expressing the fond hope that when such data becomes available more freely (after all, it is only aggregate data, doesn't really violate privacy), we might revisit the discussion to provide a sample implementation of at least some of the above.