Sunday, May 20, 2012

On Robust Hyperlinks

As the Internet grows and the WWW expands with it, users are becoming more savvy - young people today do not even have a memory of a time when it wasn't as prevalent. Savvy users tend to be more demanding of services they use. Also, as web servers become fronts for more critical business applications (e.g. store-fronts), there is the potential for revenue leakage if errors like a 404 Not Found message are returned in responses to user requests. In addition, many a time one comes across dangling hyperlinks (that point to void) which make the browsing experience less pleasant than it could otherwise have been. These kinds of scenarios make the case for Robust Hyperlinks [1].

What are they?
Robust hyperlinks are hyperlinks or URLs designed with a few extra words relating to the underlying web-page's content within them, say separated by dashes. (There are lots of FAT URLs floating around encoding all kinds of parameters in all kinds of applications, so why frown on the use of a few words to make hyperlinks more robust?)

Why use them?
Sunny Day Scenarios
Now, when the client (user sitting at a browser) makes a request for a page by clicking on the text label for the URL with a robust hyperlink, things work as usual, and the html document associated with the hyperlink is downloaded and rendered into the user's browser. So far so good, and it appears all that extra work was for naught.

Rainy Day Scenarios
What happens if Bob, the web-master for the site, carelessly updated the web-page and left that hyperlink dangling (e.g. by moving the underlying document to a different directory, or somehow renaming it)? The user at the browser receives an ugly "404 Not Found" or similar message from the server, and potentially has to start looking for other sources of information, such as, for example, by Googling for it.

Robust hyperlinks beg the question, is this second user step really necessary? If the client, say Google Chrome, Internet Explorer, Mozilla Firefox, or Opera, or whatever other browser the user uses, were able to:
1. read the server-returned error message,
2. parse it,
3. extract the five "keywords" from the request tied to that interaction,
4. query a Web Search engine for similar documents, and
5. finally present the result of such search queries back to the user seamlessly with a note indicating the original request were not fulfilled,
would that not make for a much nicer, more seamless user experience?

Alice, who makes the original html document request now has multiple links to choose from, all similar in content to the original hyperlink she clicked on, in terms of content.

How to get them to work?
In previous posts, we already explored the notions of TF-IDF or term-frequency/inverse document frequency, a mechanism that enables us to determine statistically improbably keywords or phrases from any particular document given a corpus or body of documents. If the keywords for a robust hyperlink were constructed from the top five statistically improbable phrases for the document in question, considering their TF-IDF ranking given a corpus of general web-pages (yes, this might take some work on the part of the organization hosting the web-pages, but is mostly amortized across the large number of hyperlinks they host, so the incremental cost per link is likely minimal), then this set of keywords is a statistically relevant expression of what makes this web-page more unique or less general when compared to most of the other web-pages out there.

What this also means is that if the client were to reformulate a query using these five keywords as search terms in a Web Search engine, the documents returned are likely to be good matches for the content Alice wants to consume.

Why five keywords? It appears the authors who proposed the scheme in the paper listed in the references have performed empirical tests and five was an ideal sweet spot between having too long or cumbersome URLs and too sparse URLs that might throw off the search operation to follow.

What's in it for me?
The organization hosting the original URL might argue against doing all this extra work at the cost of an additional fraction of a cent per hyperlink if this only serves to drive web-traffic away, potentially to competitor sites. A small incremental investment on the server side could be made to have the server conduct an internal search on the hosting organization's web-space to locate other documents that might serve as substitutes for the now broken link. This way, the robust hyperlink implementation serves to bolster the business case for the hosting organization's added spending as well.

Issues with Trusting Trust (with apologies to Ken Thompson)
For this scheme to work, one must necessarily trust the people who build the hyperlink to encode it only with words that are truly reflective of the content. This is also true today, but since links today are not automatically parsed by browser add-ons to query for additional content, there is a lower likelihood that an innocent click on a dead link might lead the user to see a web-page with links that are not-relevant, offensive or plain undesirable. This happens mostly because users do not typically read the structure of the link they click on. Perhaps over time, if robust hyperlinks really take off in a big way, intermediation services that verify the keywords and their relevance given the content of the underlying delivered documents will become more prevalent. Such services might be able to safeguard users' browsing experiences.

A second issue might be one of a hijacked search. If Alice starts with a certain set of search keywords that led her to the original hyperlink, she may be led off track by the keyword searches that result from the set of hyperlinks associated with the first failed document retrieval. To ensure usability considerations like this are addressed, it might make sense for the browser based implementation to give Alice a clearer sense of when a search query was done on her behalf using robust hyperlink keywords, so she can manage her Internet searches more effectively.

Examples:
The authors of the paper in the references provide several complete examples with nice, readable descriptions here.

References:
[1] http://www.eecs.berkeley.edu/Pubs/TechRpts/2000/CSD-00-1091.pdf, Thomas Phelps, Robert Wilensky, UC Berkeley.

Friday, May 11, 2012

Mining your Social Network

Stream of Consciousness Ramblings on Social Media...

On Being Human
No sentient being exists in isolation. Beings coexist together with others of their kind drawing on a sense of community, with the potential to work together to build things bigger than any one among them could even conceive of, alone. And perhaps this characteristic isn't so much restricted to just sentient beings. Are ants sentient? Bees? Both have fairly well-developed colonies.

Some leading technology companies [citation needed] work on the notion of Hive Intelligence - where reconfigurable sub-components of a systems can combine together in different ways to best adapt themselves to solve problems in some fairly different settings. Artificial intelligence has developed to a point where adaptive robots can be constructed where small robots collaborate and configure themselves into a larger automaton to solve new and difficult problems. Adaptive machines are fascinating - the Borg were arguably the most interesting race from Star Trek: The Next Generation. But we digress... The things that make societies and civilizations interesting, is mostly the collected knowledge of the species - knowledge that is accumulated and grown over millenia - an idea nicely discussed in the H.G. Wells novel "Christina Alberta's Father".

Humans are among the most complex of beings to ever exist. Environmental adaptation is almost second nature to us. Unless we are completely asocial, as some of our species sometimes are, we build networks of relationships wherever we go. And these networks evolve over time. In this post, we examine one aspect of the power of social networks, from the standpoint of network establishment, evolution, degradation, and regeneration, with a view to studying the power of this medium, and why it is perhaps unlikely that a single social networking platform will continue to predominate the global conversation.

The Times, ... They Are A Changin'....
"Know yourself and your enemy. You need not then fear the result of a thousand battles."
                                                                 -- Sun Tzu, "The Art of War"

People born into the digital age are grow up in a world where email, mobile phones, and texting are commonplace. This world also serves as an interesting laboratory to study (social) network evolution. How do relationships change over time? Mining historical email data can provide valuable clues. What bonds were stronger at what age? How does the relative strength of these bonds change? How does age of the individual, time (which generation the individual belonged to), and place (geography, and associated cultural context) define what bonds are "important", which media are used, for which aspect of social interaction, and what, and how much data is shared?

The Lonely Network
In other words, every human being is a DG (Directed Graph from Graph Theory, which may not necessarily be Acyclic in this case) in social network terms, with a node representing themselves at the center and other nodes representing their "friends". Social media connect these DGs together, "decentralizing" or democratizing them. But we can of course, learn a lot from even a single individual's graph, and studying that is perhaps much easier anyway.

In related posts, we start our exploration of this idea. In the post titled "The Reverse Inbox", we present a simple prototype that de-constructs a sample gmail mailbox, uniquely anonymizes every contact to protect the innocent, then presents a view of the mailbox owner's social network. An advantage we have with experiments of this nature is that given the large amounts of storage now available on free Internet email services, no one ever has to delete all their email, so histories going back several years can be easily extracted from even such a simple exercise.

We can see what links are important, in which directions, mine the network to infer the relative strengths of people's relationships, and construct a snail trail graph showing their evolution - the strengthening and weakening of different links, over time. We hypothesize that links tied to family, close friends, and co-workers grow stronger, with the latter fading as one moves from one job or career to the next.

When these lonely networks are joined however, the decentralizing process adds a new capability. Nodes that were hitherto unconnected in each individual's network now have paths between them. This is a concrete example of Metcalfe's Law in action: "The value of a network is proportional to the square of the number of nodes within it". You are node A, you want to get to node C, but don't know the way. The network tells you how to reach C via B, D, ...

How Social Are Social Media?
This also brings us to another idea. In Facebook and LinkedIn, all links between nodes (i.e. people) are bi-directional by default. In other words, friends are only friends if both people acknowledge the relationship, even if one does this somewhat reluctantly in some cases. Social networking seems to have misplaced the idea of "polite blocking" that was widely supported and prevalent in the Instant Messaging space. Twitter is somewhat different though, with people collecting followers like a non-rolling stone gathers moss, without necessarily pro-actively accepting each new follower into the fold. Arguably, these lead to different social dynamics.

Even more interesting is the fact that in Facebook it is relatively easy to "unfriend" someone - break the link. Can't do this so easily in LinkedIn - why, you might even need to call the company to get someone off your network. So connections on LinkedIn are perhaps worth more than those on Facebook - harder to break must mean a higher standard to forge - or people will eventually migrate to this behavior. For Twitter, we have the notion of "Social Capital": the number of one's followers. Links are cheap for individuals, but have greater value to the people they follow - since they contribute to social capital. Perhaps there is a business model here - celebrities, products, or groups being followed can set a limit to the number of links they will accept, based on certain criteria, and monetize the power of social networking per link accepted. If nothing else, perceived scarcity increases perceived notional value. (see other blog post on "Monetizing Internet Services")

A Heaven for Dead Networks?
What happens when networks die? What is death exactly? Without waxing philosophical, we can learn some lessons from examples of "death" in the social space. Two examples that leap to mind here are Friendster (now seeing a revival in Asia?) and MySpace. But to die, one has to be alive first.

Networks gain a measure of life through their expanse (number of nodes) and connected-ness (number of links). Smaller, strongly linked networks are more cohesive and perhaps provide better "value" to their constituent nodes than larger, more sparsely connected ones. Some combined measure of these two metrics would define the value that individuals connected on the network can derive from their association with it. It also stands to reason that perhaps each link in the network DG should also be weighted according to its importance given some of the discussion in the previous section.

To complete the anthropomorphic comparison: As the total metric for a network grows, it springs to life. As the rate of change of these metrics increases, the network transitions into adolescence and adulthood. As growth slows, it transitions to maturity and old age, and as nodes leave and links dangle or get deleted, the total metric decreases, and the network slowly atrophies and then dies.

Dr. Jure Leskovec at Stanford University does some interesting work with social networks including the study of the diffusion of information e.g. viral marketing and inferring social connectedness from the spread of disease within a population. His very interesting talk is here.

The Fountain of Eternal Youth
So how to stay younger longer? Since value builds from network effects and positive externalities, the best way to retain users is to a. grow value through "sticky" network services, b. evolve individual users' view of their immediate "vicinity" in their social graph to meet their social needs, and c. make it harder for them to leave (i.e. increase switching costs).

As an example, Facebook does (tries to do) all three. (a) by building their network as a social platform on which other companies like gaming firms Zynga can build and host interactive multi-player games, Facebook credits - its own currency, etc. (b) by enabling support for various kinds of interactions (professional vs. college vs. ...) and (c) by holding on to users' data and encouraging them to share more through features like "timeline".

As more users sign on, unless they can move taking their networks with them, it gets harder and harder for them to go elsewhere, especially since they have invested a lot of time building up their "presence" in the social graph, and all the connections they have established. This ensures survival of the network for a while - at least until the majority tire of it.

Forward the Federation
By no means is the battle for supremacy in this arena over. There are features Facebook lacks. Google+, Path and others are building these into their core DNA rather than as adaptations or mutations into an existing animal, and it will be interesting to see what the future brings.

At 900M users and counting, Facebook is likely not to get too many new users given almost 1 in every 6 people on the planet are already on it, and to a large extent, geography dictates social network choice (there are other social networks that are growing faster in Asia). Perhaps the next step in the evolution of social network technology is "federated social networking", but we leave that for discussion on another day.

If federated social networking becomes a reality however, the impact of "stickiness" is reduced and walls between networks become more porous, and a new dynamic will come into play.