Medusa's Cave: 2017

Wednesday, July 5, 2017

Wednesday, June 7, 2017

blockchain in a nutshell

Cryptocurrencies are all the rage today, and if recent price movements are any indication, will be in the news for some years to come yet. Whether they can mount a true challenge to fiat currencies remains to be seen, though it is heartening to see the gradual uptake in countries like Japan.

In this post, we look at the underlying technologies for cryptocurrencies - things like blockchain and bitcoin - by looking at ideas from the ground up. So let's get to it...

There are ten simple ideas we need to understand first before we get into any of these, so we can use them as building blocks for what is to come. These ideas are:

hashes or cryptographic message digests - these are one-way functions like SHA that when given any length string of input, generate a unique 128 or 160 bit string as output. The output string may be regarded a digital fingerprint of the document. Changing even one bit in the input document will on average change half the number of bits in the output, and it is computationally very expensive for anyone to try to find another document that has the same hash as a given document (for a 160 bit hash, they will on average have to look through 2^160/2 documents before they find a match - computationally infeasible), though it is trivial for anyone to compute the hash of a document they are given (hence the term "trapdoor function" or "one-way function").
public key cryptography - symmetric key encryption is widely understood. A and B have a shared secret key S. A encrypts plaintext into ciphertext using S. B uses S to decrypt whatever ciphertext A sends her. But asymmetric or public key encryption is much more interesting. This uses ideas from the difficulty of factoring the products of large prime numbers, or from elliptic curve functions (more efficient to implement on smaller devices, and more resistant to attack), to support the notion of a key-pair in encryption. Both A and B have their own pair of keys called their public and private keys respectively - they both publish their public keys, and guard their private keys as secrets, with their lives. Anything that is encrypted with the public key can only be decrypted with the corresponding private key (only from the same key pair) and vice-versa. Now if A wants to send B a secret message, A can optionally sign the message with his private key, then encrypt the signed message with B's public key, and send it to B. Then only B can read the message (needs B's private key to decrypt it), and B can check A's signature on the message - anyone with A's public key can check that A applied his private key to the message, so B can know with confidence that A generated that message, and that only B can read it.
Digital wallet - every user in the system is identified by a wallet ID which is a unique, long string of alphanumeric characters. All transactions are carried out using this ID. This ID is called the user's digital wallet. A digital wallet holds digital cash.
Pseudonymity - Only the user knows that the digital wallet ID links to her. Network nodes only know the value this ID carries. Thus transactions are not completely anonymous as with cash. Here the network knows how much each ID carries. But the network does not know who owns the ID. If Alice ever loses her wallet ID, she loses all her money - it cannot be recovered.
Transaction idempotence - imagine you are talking to an ATM over the network. You want to send John $10. You tell the ATM "Debit self, credit John, $10". Let's say after you send the message, but before you receive confirmation, your network link dies. Many possibilities arise - a. your message was never received by the ATM, b. your message was received by the ATM, was processed successfully, and you never got the confirmation, c. your message was received by the ATM, there was a problem, and the ATM's reply never reached you. If you re-send the message, your account may see a net debit of $20, which is not what you want. But if you don't resend the message, John will be unhappy if (a) or (c) happened. So what to do? Let's say your account has $100 at the start. If you send the ATM a message saying "<timestamp>,my account, $100, $90, John's account, $10", even if you resend this to the network 1000 times, the message leaves your account in a predictable state. This is called transaction idempotence.
No double spending - the number of bitcoins available is capped at 21 million. But how to ensure Alice cannot copy her money (since it is digital after all), and spend the same digital cash with Bob and then with Charlie? cryptocurrencies do this by using the idea of blockchain. Every transaction is recorded in blockchain. So somewhere in blockchain there is a record of Alice getting (say) $10 when she opened her account. Thereafter, all transactions into her account also show up in blockchain. So tracing her wallet ID backwards through processed blocks gives network nodes a clear view of how much money Alice's wallet ID has at the current time - even though they might not be able to associate that wallet ID with Alice's real world identity given pseudonymity.
Merkle trees - if you have (say) 32 hashes, you can arrange them into a tree structure - take hashes of 2 hashes each to start with (gives 16 hashes from the original 32), then hash those 16 two at a time to get 8 hashes, then 4 from the 8, 2 from the 4, and then 1 from the two. In other words, you can take 32 independent items, hash them, then melt the hashes together to get a single hash of a fixed length that ensures that if even one bit of any of the original documents changes, the top-level hash is off. This construct is called a Merkle tree.
Proof of work - we said earlier that it was very difficult, if not impossible for anyone to find a document that has the same hash as another one. What if there were a large number of nodes in a network, and they were all tasked to solving this problem? If they each concentrated in a different random area of the search space with minimal overlaps, they might be able to find a match. In fact, blockchain relies on this idea. A hash is given to all nodes (how this comes about is covered later) along with some data, and they are asked to find a secret of fixed length, which when concatenated with the given data and hashed, will result in the same hash as the one given, but with a pre-specified number of leading zeros. Since trapdoor functions are easy to compute one way, but hard the other way, once any node stumbles upon a solution, when it shares it with other nodes, they can easily verify the solution's correctness. In providing a solution to everyone else, the node provides a "proof of work" - since there are leading zeros in the published hash, no node is at an advantage when the problem is posed to the network.
Consensus algorithm - in a network with 2N nodes, when there are conflicting opinions or pieces of data, what N+1 or more nodes believe to be true, goes. Simple majority wins. Always.
Trustless entities - every element connected to the network believes only in maximizing their own gains and trusts no one else.... at all.

With the above background, we are now able to look at how cryptocurrencies and blockchain more broadly, work.

Transactions are organized into blocks - each of which has a long alphanumeric string as an ID. Each block also has a number that every node in the network knows. Every (say) ten minutes or so, nodes in the network (we will see later which nodes) gather up all as yet unprocessed transactions, verify them to see that the transactions are legitimate (e.g. signed by the spending party authorizing payment, and that the cash is not being double-spent), and then build a Merkle tree with them. Then they concatenate the block ID with the Merkle tree hash of collected transactions, and a secret they generate, compute the hash of these things together, and publish it out there to everyone on the network.

Let's say Alice who has $10 in bitcoin buys some coffee with it for $1.50. Alice's transaction then shows in the network with a timestamp, her wallet ID (a pseudonymous identifier for her wallet), and the fact that she pays $1.50 to the coffee vendor wallet ID, and pays $8.45 (say) to herself (the $0.05 difference is the fee she is paying the blockchain network for processing her transaction). This transaction is idempotent since even if it were reissued to the network, it goes by the total amount of money in Alice's account, and that can only be debited once from its value of $8.45 with the given timestamp.

Nodes can verify that Alice signed the transaction by checking the signature with Alice's public key that is widely known since it is published. Nodes can verify that Alice has $10 in bitcoin to spend, by looking through all the old accepted blocks in the blockchain till they find enough evidence that Alice has $10 to her name (rather her pseudonymous wallet ID).

Once verification is complete, her transaction is bundled with other transactions that happened since the last block was published to blockchain, a Merkle tree hash is computed with it, this is concatenated with the blockID of the last block in the blockchain, and a "salt" - a secret the creator of the current block generates, and a hash computed and sent through the network along with the block of transactions used in this computation.

All nodes in the network now scramble to find a "proof of work" - a secret that will generate the given hash but with a number of leading zeros (this number is calibrated by the number of blocks in the blockchain at any point in time, so as more nodes join and the technology gets more mature, and there is more competition to mine the search space for the solution, it keeps the degree of difficulty of solving the proof of work roughly the same). The first node (X say) to find this secret that meets the leading zeros criteria then publishes this to all other nodes in the network.

Every node that hears of this, first checks the hash to ensure the proof of work is satisfied. The node that solved the problem earns 25 BTC (bitcoin) as well as the right to build the next block in the chain, for its trouble. All nodes that hear this solution first add the disseminated new block to their view of the chain. X then gathers up the as yet unprocessed transactions, constructs a new block with a new proof of work challenge, and sends it out to the network. And the race starts again - every node competes to solve the challenge to further its own interests to make more money. The process of trying to solve the challenge is called mining. Bitcoin miners solve the proof of work challenge to get paid 25 BTC for each challenge they solve, and to collect payment for each transaction that is processed (5 cents for Alice's transaction above, for example).

If two nodes solve the challenge at the exact same time, the one that propagates faster into the network gets accepted, because the sooner it is seen, the sooner the nodes that see it start working on solving the challenge from the next block, which means the longer the accepted chain, the more likely it is accepted as true. This results in a consensus protocol and also means any attacker who wants to subvert fair processing of transactions has to take over at least N+1 of the 2N nodes in the network to exert any influence.

This also means as the chain gets longer, any records of ownership in the blockchain get more ingrained into the system. This ownership cannot be easily disputed, unless a node can show a longer chain of blocks from that block in the past, whose hashes are all correct looking forward - a task that gets incrementally more monumental given a new block is being produced by the network of nodes every ten minutes. This nearly eliminates the possibility of cheating and fraud.

That, in a nutshell, is how blockchain technology works. Of course, there are constructs like Smart Contracts that can be layered atop it to make it even more useful, but we leave those for another day.

Wednesday, May 3, 2017

disruptive technologies

In this post we take a brief look at various disruptive technologies today. This is meant to be a brief brain-dump as much for me as it is for you. Helps me remember all the things happening that are pushing the limits of what can be from what already is. Wasn't it William Gibson (author of the famed Neuromancer) who said "The future is already here, just unequally distributed"? So here we go, looking into the future as it stands today.

Drone Technology - sounds cool when Amazon talks about using drones for faster deliveries. And yes, we read about the use of drones in warfare. Drones are also used for working in hostile environments, remote photography, etc. As with all technology, this can be used for bad as well as good. As these get stealthier, are fitted with night-vision cameras etc, and can fly with "whisper-speed", things start getting scary. And then of course at the limits of miniaturization, coupled with some artificial intelligence, Crichton ideas from books like "Prey" seem quite plausible. Already people are talking about how a multitude of miniaturized drones can take down big targets working together kamikaze style.
3D Printing - the first wave of 3D printing came and went without too much "in your face" impact. But wait, it is quietly making a resurgence. Companies like Nike are printing soles of some of their shoes because this process is cost-effective, gives the right kind of texture, and quick compared to alternatives. Earlier some feedback I've heard from users was that this technology lacks finesse in terms of finer details so it cannot be used as effectively in some use cases. However, that seems to be changing with remarkable hardware improvements being made.
Self Driving Cars - I sometimes wonder if people will even remember fifty years from now that there was a time when human beings used to drive their cars to different places. This is a massive disruptor of the entire transportation ecosystem.
FinTech/BlockChain/Smart Contracts - this has huge potential to disrupt the financial services sector - things like smart contracts and roboadvisors should keep investment banks awake at night - if they aren't already. Central clearing houses will be replaced by a more democratized model of distributed databases. Fraud will likely be more easily exposed. Eric Raymond (author of the famous paper "The Cathedral and the Bazaar"), says "Given enough eyeballs, all bugs are shallow". This holds also for patterns hidden in data, when data (even anonymized) is made public. If the needle swings further in that direction, HFT might come under attack as well. And the impact of smart contracts to disrupt instruments like CDS etc? This could lead to a sea change in how business gets done!
IoT - so we slowly ramped up the devices that could generate data, added more sensors to systems we deploy, and now we have the technology to collect large amounts of data and the algorithms to process it in near real time. What next? Communicating devices or "things"? Things controllable remotely from the Internet? The rise of hacking as a more powerful threat? Cyber-warfare?
Machine Learning+Big Data+AI - with applications in everything from Internet Search to self driving vehicles (see above), machine learning and artificial intelligence threatens to pervade and disrupt every sphere of our daily lives. This need not necessarily be a bad thing - if things get better, faster, more robust, and offer a more personalized, more interactive experience factoring in user tastes (so as to not appear too creepy), what's not to like? Now we are able to collect more data from more "things" (IoT?) and then able to do more "smart" things with it.

Impacts: There are several but the one we focus on here relates to the "beveridge curve" from economics which maps the relationship between unemployment and the job vacancy rate. Why is this curve important? Because it will indicate over time that as unemployment rises, the number of available vacancies also rise. Entire fields of study, entire professions will become obsolete as some of these technologies proliferate. Unless people who are displaced by the machines learn new skills quickly, they will find themselves among the perennially unemployed. This makes a strong case for two things - children growing up today should pick up two or three different skills/professions through which they can make a living. Jobs are getting more specialized as well as getting obsoleted at a faster rate. Second, one needs to keep learning on the leading edge of technology, because this tide will keep moving ever forward, and slipping off means falling by the wayside never to catch up again. Makes for a scary situation for some, or for a very exciting landscape - depending on your point of view. The most prized skill going forward would be the ability to learn quickly so you remain relevant no matter what happens.

Tuesday, April 25, 2017

Simple Financial Modeling examples

Someone recently asked me what financial modeling entailed. This blog posts problems that involve some simple model building to answer questions one may come across in day to day life so people can better relate to what model building in finance might involve. I will add to this post over time, as more interesting problems occur to me.

You want to buy a car, say a new Toyota Sienna priced at $33,500. The dealer gives you two options - you can purchase the car with $1500 off all cash, or at the stated price with $10,000 down and 4% financing over 10 years - this is an amortized loan for the remainder, just like a mortgage. You manage to negotiate the price down to $31,500 and $32,000 for the two options respectively. Which option should you pick if you know you have the ability to generate 8% annual returns with your capital on average, in the stock market assuming you have $31,500 to spend at the moment?
You are considering getting an MBA. Going full-time involves losing one or two years of pay (depending on the length of the program), and top programs cost a lot of money (a typical M7 MBA costs around ~$150,000). Of course, a high quality degree gives you qualifications that open doors to various industries and jobs with a potentially much higher pay. And then there is the part-time MBA option where you can keep your paycheck coming, but where the course takes longer to complete. How do you decide whether or not to get the degree, the comparative costs vs benefits of an MBA vs. for example a CFA if you are looking at finance careers, and then, which kind of MBA to get if you decide to go that way?
You want to invest in a stock. You have to determine whether the stock is trading cheap or rich to its valuation, and at what point to buy (since buying things rich means you will likely see lower returns on your investment). You also want to ensure your money is best spent invested in this stock vs in other potential investments also open to you. And you want to ensure you are OK with the volatility this stock's price action is likely to see so you will not be a panic seller.

Wednesday, April 19, 2017

People Analytics [On-going Sentiment Analysis within Corporations]

Reflections on Simple Sentiment Analysis within Corporations

Sentiment analysis is often used to mine for people’s feelings or sentiments as they relate to particular topics, ideas, or concepts. This information is typically gathered from their online posts and other behavior, but can also be imputed from users’ “filter-bubble” – the articles they are more predisposed to click on – since many people click on links that support for confirm their own deeply held views or convictions (confirmation bias). There are many sources from where such sentiment can be mined, including the blogosphere, online news, email, instant messages, twitter, … the list goes on. In this paper, we focus on sentiment analysis of people’s email, and the potential applications this might have in improving business.

Techniques and methodology – supervised learning

Supervised learning is a machine learning technique where a training data set is provided which has values assigned to a set of independent variables with the associated value of the dependent (to be forecasted) one to be used for training a forecasting model. The model is then tuned with a validation set and used to generate predictions (predictive analytics) for “out-of-sample” data from the test set. A good model will produce accurate forecasts or predictions for test set data – the better the predictive model, the better the classification rate, and the smaller the number of false positives and false negatives.

In sentiment analysis in its most basic form, supervised learning can be used in one of two ways – a. to classify test-set documents (news stories, reviews, email, etc.) as positive or negative, or b. to elicit “feelings” along a graded scale that indicate how the user feels about the topics in question (this is along the lines of the “Goldstein Scale for WEIS Data”[G], but tailored to our domain of interest, since the Goldstein scale is for a specific problem domain – news of conflict in international events). These feelings or the sentiment aggregated across the user population can be used to make decisions - buy or sell decisions for stocks, talent management decisions at corporations, etc. given the particulars of the problem domain in question.

Why do we need a Supervised Training Set for this exercise?

A supervised training set indicates clearly what text is associated with positive or negative sentiment. From data elements in the supervised set, one can mine words or phrases that are associated with positive or with negative sentiment as outlined below. Once mined, these phrases can then be used to determine the sentiment of new text, posts or email as appropriate, within the context of the model used to make predictions.

Obtaining the data set

So, where can we get data from, and how might we use it? The website Yelp is an online forum that permits users to review businesses they interact with. Over time, they have collected a large data set of reviews. They periodically host machine learning competitions [YC] to see what insights they can glean from the review data they have available. In a Kaggle-like fashion, these data-sets are provided free-of-cost to data science professionals, mostly students, for analysis. Each review in the provided data set comes with a date-stamp of when the review was written, the complete identity of the business being reviewed, the text of the review itself, the particulars of the reviewer, and a star rating for how much or whether the user posting the review approved of the business. There may be other fields, we find these to be the most relevant for our current purposes.

But this data is for another domain… will it still be useful?

Recall we are simply looking for data (preferably supervised data) that can somehow tie particular phrases to particular sentiments. If we are able to extract statistically improbable phrases from each review, then tie these probabilistically with the model we are building for email, instant messages or other such data extracted from a corporate context, we will be able to draw interesting inferences in corporates just as easily. We explore potential techniques for such machine learning analysis in the sections that follow.

Essentially what we are doing here is extracting learning(s) from one data set, then transferring it to another data set similar in type but different in content (transfer learning across models and data-sets). [TL]

Work email may be more formal than a restaurant review, while work IM may be less formal and even use short-forms and jargon people outside the business context might not understand. The first problem may be addressed by using a thesaurus to provide synonyms for words for the various N-grams constructed for the training sample, so we end up with multiple new N-grams with equivalent meanings derived from the ones mined from the original Yelp data-set. The second might be addressed in one of two ways:

a. gathering more jargon or short-cut rich texts and manually assigning them a sentiment rating to seed the supervised learning population, and

b. building a dictionary from manual analysis of IM short-cuts to programmatically “repair” the texts into a form suitable for automated analysis.

Of course, the N-gram generation process only proceeds after we remove commonly used words (“stop words”) from a simple statistical analysis of every review in our data-set, and apply industry-standard methods such as stemming and lemmatization [SL].

Method 1: N-gram training, thesaurus synonyms in n-grams [NG]

Individual words, as well as pairs, triplets, and quads of consecutive words (called N-grams where N denotes the number of words in each “gram”) are gathered for all the review text we have, and then analyzed to generate a probabilistic map between their presence in a review and the associated review rating. This process assumes that individual N-grams are independent of each other in their contributions to review sentiment. Once this process is complete, we end up with something similar to the Goldstein data referenced above as output from a Naive Bayes classifier. The degree of nuance we see in the output ties to the number of stars in each review, as opposed to a numeric score, though we can of course consider the number of stars in a review to be a proxy for a numeric score in each case.

Method 2: TF-IDF-based clustering [TFI]

The frequency of individual N-grams in each review (called term frequency or TF) and the relative inverse frequency of each N-gram in comparison with all other N-grams across all the review text we have (the totality of review text is called the corpus, and the relative inverse frequency is called the inverse document frequency or IDF) can be used to locate statistically improbable phrases (or SIPs) that can be used to identify particular reviews. If reviews with given SIPs more closely correlate with particular sentiment buckets, this can be used to aid the classification process going forward.

Method 3: K-Means Clustering [KMC]

Consider the set of N-grams in N-dimensional space. If K-means clustering is performed on this data where K=number of “star” classes, then the N-grams associated with each class are indicators of sentiment for that number of buckets in the input data. This analysis can then be used to tie the results to a sentiment scale. Synonyms can then be generated using a thesaurus, and the knowledge exported to the analysis of work-email text.

Interesting… but is this analysis complete?

If we were analyzing Yelp data for deriving insights within the Yelp context, the above analysis is incomplete because we have so far looked only at the reviews, not at the reviewers. Reviewer A may be a hard person to please, with reviews always averaging 1-3 stars. Reviewer B may gush about all businesses he visits, giving out a large number of 5 star ratings. To perform a complete and credible analysis, we need to be able to normalize reviews from different reviewers prior to performing classifications.

Secondly, people may appear as fake reviewers for only their own, and their competitors’ businesses (obviously talking up their own business while slamming the competition). Given enough reviewer data, we can determine whether reviewers are credible from the set of all reviews they each have posted, and use that as a basis for either including, or excluding, the reviews they have posted to the Yelp website, as training, validation, and test sets are built.

Please note however that here we are more interested not in analyzing data within the Yelp context, but in transferring learning of sentiment indicators to a different context (work email) for further analysis. This the above two problems do not impact the quality of our model construction for analyzing work email/IM text.

Applications in People Analytics [PA]:

Sentiment spreads as a “ripple” in space-time - where the “pond” is the corporate context. Email and IM text may be the medium through which sentiment propagates, but the network of people (different from the top-down enforced organization chart) forms the structure which “conducts” sentiment.

Email might express frustration with processes in place, people in management, work load, among other things. Similarly, some written communication might express joy at being able to make a positive contribution to a project, happiness with team leadership, positive feelings at being able to learn and apply new skills, with internal mobility within the company etc. Knowing these things can help the HR department of a company perform their roles more effectively.

Does rampant absenteeism correlate with particular managers? Are people quitting from particular teams more frequently? Are promotions so few and far between and bonuses so low people are being forced to look elsewhere for more viable careers? Is a department so overloaded that staffing them up might reduce attrition despite the announced firm-wide hiring freeze? Are particular employees (particularly in the financial services industry) engaging in nefarious behavior either within the firm or across firms with outside collaborators?

These and more questions can be answered from analyzing email and related data such as Instant Messaging and Internet Chat (... and in some cases, text renderings of phone conversations especially where regulations require that these be recorded e.g. trader lines in financial services). We explore a few scenarios of interest in a little more detail in what follows:

Sentiment-Announcement correlation

Senior management makes an announcement reshuffling the organization or otherwise changing the organizational design. People of course talk about these things. Analyzing email sentiment on an ongoing basis will give us a means to impute any changes in sentiment against announcements that are made and impact on morale. Certain geographies may be more impacted by lay-off announcements than others - this will show when we analyze sentiment by geography. Some announcements might cause immediate negative sentiment that dissipates quickly, while others might cause negativity that persists over time. All this can be factored into not only what is announced, how it is announced (e.g. by a CEO in a town-hall to appear more humane vs. via email), and what words are used to convey the information.

Ongoing Employee Engagement Sentiment

How do employees feel about the company? Today, most firms have an annual survey where every employee fills out a questionnaire regarding their work environment. However, perhaps a better measure of engagement can be obtained through periodic email analysis - this will tell us what people are thinking on an ongoing basis, not just once at the end of the year. Besides, while people might think one way and say something else during a survey, sentiment in ongoing communications like email is likely to be much more honest and a more relevant indicator of how management is performing.

Sentiment for Incentive Design

What makes each employee tick? Some like being praised for their work. Others want to get paid more. Some others might want a better title. Yet others might want more of a certain kind of work and less of something else. Not all employees have engaged managers they can speak with about these things. All of this can be mined from email and chat data, and HR can use this to structure incentives to promote retention and reduce attrition of strong performers.

Sentiment for Right-Sizing Organizations

“Right-Sizing” is a convenient euphemism senior management uses for laying people off. While these decisions are painful, the process of deciding which people to let go is many a time made poorly. A recent article (need reference) explored how sometimes people viewed by senior managers as being in the bottom 10% of performers are actually those that are the glue that holds a group together. They spend their time ensuring all their team-mates are successful, so letting go those people actually hurts your firm’s performance more than the savings in their salary helps the firm’s bottom-line. Mining email and text within the context of the firm’s social network (informal connections superimposed on the org-charts) can help organizations make these painful decisions more appropriately.

Shades of 1984? Big Brother Watching?

Of course, privacy advocates will not like any of the above, and there is some validity to contrary viewpoints on the use of work email for the purposes stated above. However, most employers today, particularly those in areas like Financial Services, have very clear policies in place that indicate that “no employee can have any reasonable expectation of privacy for any communication carried out using office equipment or their office email address”… and that they consent for this information to be logged, and records kept for a period of seven years from the date of communication. The retention policy may vary from industry to industry and firm to firm, but most firms today have a data retention policy for legal reasons, so it is not altogether unreasonable that applications such as the above will become practical in time to come…

Besides, Google auto-reads Gmail content to pick ads to display, so it is not entirely far-fetched for work email to be read to improve organizational policy, especially when such determinations are made using data in the aggregate where individual users are not identified (the same thing is done in surveys today).

A valid argument against doing what we discuss in this paper can be made if one asks whether people will still honestly communicate their sentiments in email if they know they are being mined. After all, to give a concrete example, did Goldman employees not take to writing “LDL” (Let’s Discuss Live) when the “Fabulous Fab” and Rajat Gupta scandals broke, when they knew their electronic communications were being tapped by law enforcement agencies? True, but the very fact that these abbreviations show up in written records tells us something about sentiment.

References:

[G] Goldstein Scale for WEIS Data: http://web.pdx.edu/~kinsella/jgscale.html

[SL] https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

[YC] https://www.yelp.com/dataset_challenge/dataset

[YW] https://www.yelp.com

[NG] http://text-analytics101.rxnlp.com/2014/11/what-are-n-grams.html

[TFI] https://en.wikipedia.org/wiki/Tf%E2%80%93idf

[KMC] https://en.wikipedia.org/wiki/K-means_clustering

[PA] https://rework.withgoogle.com/subjects/people-analytics/

[TL] https://en.wikipedia.org/wiki/Inductive_transfer

Friday, March 3, 2017

probability: of balls and urns

This is an unsolved problem from "Probability, a concise course, by Y A Rozanov". Very instructive in how to perform probability calculations.

There are N urns numbered 1, 2, ..., N in a row, each containing r red and b blue balls. A ball is drawn at random from the first urn and transferred to the second, then a ball is sampled from the second and transferred to the third, and so on, until you finally sample a ball from the Nth urn. The question is, what is the probability of drawing a red ball from the Nth urn?

I've heard people give all kinds of answers, but here's the correct way to solve the problem.

Start from urn #1.
What is the probability of sampling a red ball? r/(r+b).
A black ball? b/(r+b).

So either a red ball or a black ball is sampled and put into urn #2.
So urn #2 might have an extra red ball (in r/(r+b) cases), or an extra blue ball (in b/(r+b) cases).
So while sampling from urn #2, in r/(r+b) cases, you are sampling from r+1 red balls and b blue balls, and in b/(r+b) cases, you are sampling from r red balls and (b+1) blue balls.
So the probability of sampling a red ball from urn #2 becomes:
r/(r+b) * (r+1)/(r+b+1) + b/(r+b) * r/(r+b+1)
Simplifying, we get, p(drawing red ball from urn #2) = r/((r+b)(r+b+1)) * (r+1+b) = r/(r+b)
Wow! So regardless of the complexity of the experimental setup, the probability of drawing a red ball from urn #2 remains r/(r+b) JUST AS IT WAS FOR THE DRAW FROM URN #1!

Now, we can simply re-run the experiment thinking of urn #2 as urn #1, and urn #3 as urn #2 as above, and just keep repeating it till we reach urn #N. The probability of drawing a red ball from urn #N is therefore also r/(r+b). Interesting question.

probability: coin flipping problems

Interesting question someone posed to me recently. These appear to be favorites at interviews.

You and I each have 3 coins we toss. If you get the same number of heads as I do, I pay you $2. If we get a different number of heads, you pay me $1. Would you take the bet?

An incorrect way of solving this problem is to deduce that there are only 4 of 16 possible ways where we get the same number of heads (0H-0H, 1H-1H, 2H-2H, and 3H-3H), so the pay-off is 4/16*2 - 12/16*1 = -4/16. Which is negative so you should not take the bet.

The above reasoning is incorrect because the likelihood of each of the 16 outcomes is different, so they should not be equally weighted in the probability calculation. A more correct way of thinking about this is as below:

Consider 3 coins as 3 bits, where a zero denotes tails and a 1 denotes heads.

Then (binary numbers 0 through 7)
000 - zero heads
001 - 1 head
010 - 1 head
011 - 2 heads
100 - 1 head
101 - 2 heads
110 - 2 heads
111 - 3 heads

From this we get:
p(0 heads)=1/8
p(1 head)=3/8
p(2 heads)=3/8
p(3 heads)=1/8
Total probability = (1+3+3+1)/8=1 (check)

Now we imagine a grid - x axis is my number of heads, y axis is your number of heads. In each grid slot in the diagonals, we fill in the probability.

p(0H,0H)=1/8*1/8=1/64
p(1H,1H)=3/8*3/8=9/64
p(2H,2H)=3/8*3/8=9/64
p(3H,3H)=1/8*1/8=1/64
Total probability of equal heads = 20/64
=> Total probability of un-equal heads = 44/64.

Since 20*2<44, again, the expected value of the offered bet is negative, and so no, you should not play.

Another problem:
I have K coins, you have K+1 coins. We both toss our coins. You win if you toss more heads than me. What is the probability you win?

Easy to solve if we think using small models and build up intuition.

Consider K=2 (0, 1, or 2 heads). X axis is me, Y axis is you.

0H 1H 2H
0H 1/2 L L
1H W 1/2 L
2H W W 1/2

So here, you win for certain in 3 cases, can win with probability 1/2 in 3 cases, and lose in 3 cases. You win the diagonal outcomes with probability 1/2 because you can cast the tie-breaker toss with your K+1st coin, which will come up heads with probability 1/2.

If we were to extrapolate from this upwards to K coins, you see a grid of size K+1 in each direction, where you win K+1 outcomes with probability half, ((K+1)^2-(K+1))/2 outcomes with probability 1, and lose the same number of outcomes out of a total of (K+1)^2 possible outcomes.

So p(you win) = (((K+1)^2-(K+1))/2 *1 + (K+1)*1/2)/(K+1)^2.
Simplifying we get:
Numerator=(K^2+2K+1-K-1+K+1)/2=(K^2+2K+1)/2
Denominator=(K^2+2K+1)
Which gives us an answer of 1/2. So your probability of winning is 1/2.