Medusa's Cave: building a smarter search

AI is all the rage these days. It stands to reason that large companies are trying to incorporate it wherever possible to improve the quality of their products. This also holds in the Search space. There are two ways a user's search experience can be improved - one, the user becomes a better searcher, picking better keywords to search. The second, more interesting method is where the search engine becomes more attuned to the user through frequent interactions with her, and delivers better results as a direct result of those interactions - this requires some intelligence or adaptiveness on the part of the search engine. This requires artificial intelligence... at least in the more advanced use cases.

Sure, this means the search giants need to spend more to deliver better results. But this is a battle to capture users and eye-balls. And the company that gets there first increases "stickyness"- a search more finely attuned to your tastes builds a huge moat around itself. Unless that search context can be transferred from one search engine to another, once you've "captured" a user, she is likely to stay with you for good. And the more users you have, the better you'll be able to monetize them. Search goes social at some point, and well, there's no looking back from there.

Gone are the days when search used to be driven simply by the terms in the input. Recall those simpler times? One would use techniques like tf-idf to identify statistically improbable words from a corpus that occur with regularity in the documents of interest, and return those based on the goodness of fit with the search criteria. (OK, this is an oversimplification. Techniques such as stemming, synonym detection etc could be used to give the search query more power.)

There are different ways of determining goodness of fit of course - using cosine distance etc which are relatively easy to implement. Then large search engines implemented ideas such as using the context of the text on particular web-pages - the color, the size, and other metrics, to determine if a page was worth including in the search results.

Page rank was a further improvement where the computation of the reputation of the pages linking to a certain target web-page gave the target a score. The better the reputation of a page's in-links, the more credibility the page had in its search results.

Next came the filter bubble. You log into the service suite provided by the company running the search function. Google, Microsoft etc all offer much more than just search - they offer a full suite of services including email, drive storage in the cloud, a blogging portal (such as this one), and other applications (e.g. documents) designed to make use of cloud-hosted data, etc. A lot of these services require that you log into their service suite. And once you log in... they get to see, and store your information - the kinds of searches you are doing, and definitely the keywords you are looking for, and the links you click on. This is a veritable gold mine of information.

This leads to predictive search - the search toolbar can now predict the kinds of keywords you are looking for. If most of your queries are for pdf documents with say a "filetype:pdf" at the end of your query, then the toolbar can suggest that suffix to your typed keywords in the query string.

But that is not all. Now that they have your query set and the set of links you clicked, the service is able to use this data to a. better tune the results for future queries by using the context of previous queries from keywords typed earlier, and b. use the content of the clicked links and times spent on each page (e.g. user clicks on a link and goes back to the search results), to determine how better to service your search queries going forward.

Complaints from users were factored in to negatively affect the reputation of links offered up in some search results. Some sites were "banned" or sent to "Google hell", which meant their reputation was so badly affected they would not show in results anymore. A manual negative feedback of sorts.

This leads to the filter bubble. If you log into the search engine suite of services and run a query, you get different results from what you would otherwise have gotten if you ran the same query but logged out of the search engine account. Similarly, different people with different search histories would almost certainly get a different set of results possibly even on the very first page, even if they ran the search with the very same keywords. In other words, the filter bubble sets the context for the search results that are delivered.

This is where things start to get interesting.

Using services like Amazon's Mechanical Turk that utilizes human intelligence seamlessly to solve small problems, one could index large amounts of images, then use the image captions to deliver selected images into search results based on search keywords. An early example of a large classified image dataset for training supervised machine learning algorithms comes from Stanford's Fei-Fei Lee.
Google has a video service - youtube videos have captions - these can be auto-generated. Alternately videos have descriptive text, some even have comments. Filtering statistically improbable phrases from either, and using these as a basis for search with keywords can deliver relevant search results.
As speech processing improves - Siri, Cortana, Google Voice Services, etc are getting better every day thanks again to AI and NLP technology, audio files can be auto-captioned. And links to relevant audio can also be returned in search results.
Other media such as Maps can also factor into search results in interesting ways. This gives a local (as in geography) flavor to search.
Microsoft now has (or soon will have) the entire LinkedIn database at its disposal. Search results could now contain links to people... and to jobs where the specified keywords apply. More and more we see the world turning into a huge graph of interconnected data elements.
We are already at a point where a generic search engine can deliver better search results within a certain domain (like say, UPS) better than the hosted search within that domain's website - at least, this was true until more and more websites started hosting Google or Bing searches as their within domain defaults.

Similar ideas can be applied to other kinds of media as well. But this is all soo... yesterday. So what more can we do? With deep learning, i.e. neural networks with neurons layered deep, one is able to automatically sort images - cluster them together on common themes. This works well in many, but not in all cases. Deep networks were reported in recent solutions to have particular output neurons that could recognize cats and dogs, but there were also neurons whose outputs could not easily be discerned to correspond to any one major idea. And for some reason, composite objects like cars were not always easy to recognize... go figure. But this means a way forward might be to use AI and deep learning to automagically classify images for use in search results - with a diminishing (to zero) human component.

Are we done? What more can we do? Well, take the results of similar queries and look at the clicked links especially if you have access to metrics like the time a user spent on particular web-pages etc. The search engine can learn from clicks - similar to the query by example, or QBE paradigm from early relational databases, to learn what particular users liked, then using these clicks to further dynamically refine the search results.

Statistical text analysis uses probability theory to classify words and phrases as different components of a sentence (e.g. noun, verb, adjective etc), and potentially also as having different meanings (semantics) based on how the sentences are structured. The same word can have different meanings in different contexts - earlier search would sometimes get confused by this - for instance, not so long ago, Google News had, under one grouping, stories about Jordan the country, and Michael Jordan the basketball star. Statistical text analysis when applied appropriately, would minimize, and potentially even eliminate, errors of this nature.

Taking a page from recommender systems, if two users have similar search contexts, it is likely that pages one user clicked on in a search for the same keywords would also have potential higher importance in the other user's search. (Users with similar search tastes cluster together in n-dimensional search space - just like birds of a feather.)

In fact, one could utilize advanced methods to determine the occurrence of non-keyword statistically improbable words or phrases from the highest ranking search results for a given set of keywords, and then use these as "phantom keywords" to improve the quality of search results further so the resulting set of delivered results is even better... and gets better with time - since user clicks serve as a learning input to the users' filter bubbles with greater use.

It would be nice if users can share filter bubbles without having to share their logins. If this was supported, users joining a project late (say) would be able to execute the same level of high quality queries as those involved with that particular research area for some time. Search now becomes a social experience. There are already implementations of local search - a search engine guruji.com was built to cater to Indian audiences with search results more focused on local phenomena - news, locations etc. This went bust once Google implemented similar features.

Perhaps in the not too distant future, there might be a market-place for search filter bubbles - you buy one that gives you the best, most tailored results to your domain, Or you build one for a particular domain, then sell it to others interested in the same. If these filter bubbles can transcend barriers between search engines, late entrants could slowly wear down the first mover's advantage. If not, it is likely to be winner take all (or at the very least, a LOT)... at least for a while. Playing catch-up is no fun.

Lots more areas for advancement still open.... All in all, AI has huge potential to advance the state of search.... and to enable first movers to milk a huge cash cow. After all, as search results get better targeted, so too will micro-targeted ads... generating even more revenue. It pays to be on the bleeding edge... or get run over.

Medusa's Cave

Thursday, October 6, 2016

building a smarter search

No comments:

Post a Comment