Sunday, May 20, 2012

On Robust Hyperlinks

As the Internet grows and the WWW expands with it, users are becoming more savvy - young people today do not even have a memory of a time when it wasn't as prevalent. Savvy users tend to be more demanding of services they use. Also, as web servers become fronts for more critical business applications (e.g. store-fronts), there is the potential for revenue leakage if errors like a 404 Not Found message are returned in responses to user requests. In addition, many a time one comes across dangling hyperlinks (that point to void) which make the browsing experience less pleasant than it could otherwise have been. These kinds of scenarios make the case for Robust Hyperlinks [1].

What are they?
Robust hyperlinks are hyperlinks or URLs designed with a few extra words relating to the underlying web-page's content within them, say separated by dashes. (There are lots of FAT URLs floating around encoding all kinds of parameters in all kinds of applications, so why frown on the use of a few words to make hyperlinks more robust?)

Why use them?
Sunny Day Scenarios
Now, when the client (user sitting at a browser) makes a request for a page by clicking on the text label for the URL with a robust hyperlink, things work as usual, and the html document associated with the hyperlink is downloaded and rendered into the user's browser. So far so good, and it appears all that extra work was for naught.

Rainy Day Scenarios
What happens if Bob, the web-master for the site, carelessly updated the web-page and left that hyperlink dangling (e.g. by moving the underlying document to a different directory, or somehow renaming it)? The user at the browser receives an ugly "404 Not Found" or similar message from the server, and potentially has to start looking for other sources of information, such as, for example, by Googling for it.

Robust hyperlinks beg the question, is this second user step really necessary? If the client, say Google Chrome, Internet Explorer, Mozilla Firefox, or Opera, or whatever other browser the user uses, were able to:
1. read the server-returned error message,
2. parse it,
3. extract the five "keywords" from the request tied to that interaction,
4. query a Web Search engine for similar documents, and
5. finally present the result of such search queries back to the user seamlessly with a note indicating the original request were not fulfilled,
would that not make for a much nicer, more seamless user experience?

Alice, who makes the original html document request now has multiple links to choose from, all similar in content to the original hyperlink she clicked on, in terms of content.

How to get them to work?
In previous posts, we already explored the notions of TF-IDF or term-frequency/inverse document frequency, a mechanism that enables us to determine statistically improbably keywords or phrases from any particular document given a corpus or body of documents. If the keywords for a robust hyperlink were constructed from the top five statistically improbable phrases for the document in question, considering their TF-IDF ranking given a corpus of general web-pages (yes, this might take some work on the part of the organization hosting the web-pages, but is mostly amortized across the large number of hyperlinks they host, so the incremental cost per link is likely minimal), then this set of keywords is a statistically relevant expression of what makes this web-page more unique or less general when compared to most of the other web-pages out there.

What this also means is that if the client were to reformulate a query using these five keywords as search terms in a Web Search engine, the documents returned are likely to be good matches for the content Alice wants to consume.

Why five keywords? It appears the authors who proposed the scheme in the paper listed in the references have performed empirical tests and five was an ideal sweet spot between having too long or cumbersome URLs and too sparse URLs that might throw off the search operation to follow.

What's in it for me?
The organization hosting the original URL might argue against doing all this extra work at the cost of an additional fraction of a cent per hyperlink if this only serves to drive web-traffic away, potentially to competitor sites. A small incremental investment on the server side could be made to have the server conduct an internal search on the hosting organization's web-space to locate other documents that might serve as substitutes for the now broken link. This way, the robust hyperlink implementation serves to bolster the business case for the hosting organization's added spending as well.

Issues with Trusting Trust (with apologies to Ken Thompson)
For this scheme to work, one must necessarily trust the people who build the hyperlink to encode it only with words that are truly reflective of the content. This is also true today, but since links today are not automatically parsed by browser add-ons to query for additional content, there is a lower likelihood that an innocent click on a dead link might lead the user to see a web-page with links that are not-relevant, offensive or plain undesirable. This happens mostly because users do not typically read the structure of the link they click on. Perhaps over time, if robust hyperlinks really take off in a big way, intermediation services that verify the keywords and their relevance given the content of the underlying delivered documents will become more prevalent. Such services might be able to safeguard users' browsing experiences.

A second issue might be one of a hijacked search. If Alice starts with a certain set of search keywords that led her to the original hyperlink, she may be led off track by the keyword searches that result from the set of hyperlinks associated with the first failed document retrieval. To ensure usability considerations like this are addressed, it might make sense for the browser based implementation to give Alice a clearer sense of when a search query was done on her behalf using robust hyperlink keywords, so she can manage her Internet searches more effectively.

The authors of the paper in the references provide several complete examples with nice, readable descriptions here.

[1], Thomas Phelps, Robert Wilensky, UC Berkeley.

No comments:

Post a Comment