Reflections on Simple Sentiment Analysis within Corporations
Sentiment analysis is often used to mine for people’s feelings or sentiments as they relate to particular topics, ideas, or concepts. This information is typically gathered from their online posts and other behavior, but can also be imputed from users’ “filter-bubble” – the articles they are more predisposed to click on – since many people click on links that support for confirm their own deeply held views or convictions (confirmation bias). There are many sources from where such sentiment can be mined, including the blogosphere, online news, email, instant messages, twitter, … the list goes on. In this paper, we focus on sentiment analysis of people’s email, and the potential applications this might have in improving business.
Techniques and methodology – supervised learning
Supervised learning is a machine learning technique where a training data set is provided which has values assigned to a set of independent variables with the associated value of the dependent (to be forecasted) one to be used for training a forecasting model. The model is then tuned with a validation set and used to generate predictions (predictive analytics) for “out-of-sample” data from the test set. A good model will produce accurate forecasts or predictions for test set data – the better the predictive model, the better the classification rate, and the smaller the number of false positives and false negatives.
In sentiment analysis in its most basic form, supervised learning can be used in one of two ways – a. to classify test-set documents (news stories, reviews, email, etc.) as positive or negative, or b. to elicit “feelings” along a graded scale that indicate how the user feels about the topics in question (this is along the lines of the “Goldstein Scale for WEIS Data”[G], but tailored to our domain of interest, since the Goldstein scale is for a specific problem domain – news of conflict in international events). These feelings or the sentiment aggregated across the user population can be used to make decisions - buy or sell decisions for stocks, talent management decisions at corporations, etc. given the particulars of the problem domain in question.
Why do we need a Supervised Training Set for this exercise?
A supervised training set indicates clearly what text is associated with positive or negative sentiment. From data elements in the supervised set, one can mine words or phrases that are associated with positive or with negative sentiment as outlined below. Once mined, these phrases can then be used to determine the sentiment of new text, posts or email as appropriate, within the context of the model used to make predictions.
Obtaining the data set
So, where can we get data from, and how might we use it? The website Yelp is an online forum that permits users to review businesses they interact with. Over time, they have collected a large data set of reviews. They periodically host machine learning competitions [YC] to see what insights they can glean from the review data they have available. In a Kaggle-like fashion, these data-sets are provided free-of-cost to data science professionals, mostly students, for analysis. Each review in the provided data set comes with a date-stamp of when the review was written, the complete identity of the business being reviewed, the text of the review itself, the particulars of the reviewer, and a star rating for how much or whether the user posting the review approved of the business. There may be other fields, we find these to be the most relevant for our current purposes.
But this data is for another domain… will it still be useful?
Recall we are simply looking for data (preferably supervised data) that can somehow tie particular phrases to particular sentiments. If we are able to extract statistically improbable phrases from each review, then tie these probabilistically with the model we are building for email, instant messages or other such data extracted from a corporate context, we will be able to draw interesting inferences in corporates just as easily. We explore potential techniques for such machine learning analysis in the sections that follow.
Essentially what we are doing here is extracting learning(s) from one data set, then transferring it to another data set similar in type but different in content (transfer learning across models and data-sets). [TL]
Work email may be more formal than a restaurant review, while work IM may be less formal and even use short-forms and jargon people outside the business context might not understand. The first problem may be addressed by using a thesaurus to provide synonyms for words for the various N-grams constructed for the training sample, so we end up with multiple new N-grams with equivalent meanings derived from the ones mined from the original Yelp data-set. The second might be addressed in one of two ways:
a. gathering more jargon or short-cut rich texts and manually assigning them a sentiment rating to seed the supervised learning population, and
b. building a dictionary from manual analysis of IM short-cuts to programmatically “repair” the texts into a form suitable for automated analysis.
Of course, the N-gram generation process only proceeds after we remove commonly used words (“stop words”) from a simple statistical analysis of every review in our data-set, and apply industry-standard methods such as stemming and lemmatization [SL].
Method 1: N-gram training, thesaurus synonyms in n-grams [NG]
Individual words, as well as pairs, triplets, and quads of consecutive words (called N-grams where N denotes the number of words in each “gram”) are gathered for all the review text we have, and then analyzed to generate a probabilistic map between their presence in a review and the associated review rating. This process assumes that individual N-grams are independent of each other in their contributions to review sentiment. Once this process is complete, we end up with something similar to the Goldstein data referenced above as output from a Naive Bayes classifier. The degree of nuance we see in the output ties to the number of stars in each review, as opposed to a numeric score, though we can of course consider the number of stars in a review to be a proxy for a numeric score in each case.
Method 2: TF-IDF-based clustering [TFI]
The frequency of individual N-grams in each review (called term frequency or TF) and the relative inverse frequency of each N-gram in comparison with all other N-grams across all the review text we have (the totality of review text is called the corpus, and the relative inverse frequency is called the inverse document frequency or IDF) can be used to locate statistically improbable phrases (or SIPs) that can be used to identify particular reviews. If reviews with given SIPs more closely correlate with particular sentiment buckets, this can be used to aid the classification process going forward.
Method 3: K-Means Clustering [KMC]
Consider the set of N-grams in N-dimensional space. If K-means clustering is performed on this data where K=number of “star” classes, then the N-grams associated with each class are indicators of sentiment for that number of buckets in the input data. This analysis can then be used to tie the results to a sentiment scale. Synonyms can then be generated using a thesaurus, and the knowledge exported to the analysis of work-email text.
Interesting… but is this analysis complete?
If we were analyzing Yelp data for deriving insights within the Yelp context, the above analysis is incomplete because we have so far looked only at the reviews, not at the reviewers. Reviewer A may be a hard person to please, with reviews always averaging 1-3 stars. Reviewer B may gush about all businesses he visits, giving out a large number of 5 star ratings. To perform a complete and credible analysis, we need to be able to normalize reviews from different reviewers prior to performing classifications.
Secondly, people may appear as fake reviewers for only their own, and their competitors’ businesses (obviously talking up their own business while slamming the competition). Given enough reviewer data, we can determine whether reviewers are credible from the set of all reviews they each have posted, and use that as a basis for either including, or excluding, the reviews they have posted to the Yelp website, as training, validation, and test sets are built.
Please note however that here we are more interested not in analyzing data within the Yelp context, but in transferring learning of sentiment indicators to a different context (work email) for further analysis. This the above two problems do not impact the quality of our model construction for analyzing work email/IM text.
Applications in People Analytics [PA]:
Sentiment spreads as a “ripple” in space-time - where the “pond” is the corporate context. Email and IM text may be the medium through which sentiment propagates, but the network of people (different from the top-down enforced organization chart) forms the structure which “conducts” sentiment.
Email might express frustration with processes in place, people in management, work load, among other things. Similarly, some written communication might express joy at being able to make a positive contribution to a project, happiness with team leadership, positive feelings at being able to learn and apply new skills, with internal mobility within the company etc. Knowing these things can help the HR department of a company perform their roles more effectively.
Does rampant absenteeism correlate with particular managers? Are people quitting from particular teams more frequently? Are promotions so few and far between and bonuses so low people are being forced to look elsewhere for more viable careers? Is a department so overloaded that staffing them up might reduce attrition despite the announced firm-wide hiring freeze? Are particular employees (particularly in the financial services industry) engaging in nefarious behavior either within the firm or across firms with outside collaborators?
These and more questions can be answered from analyzing email and related data such as Instant Messaging and Internet Chat (... and in some cases, text renderings of phone conversations especially where regulations require that these be recorded e.g. trader lines in financial services). We explore a few scenarios of interest in a little more detail in what follows:
Senior management makes an announcement reshuffling the organization or otherwise changing the organizational design. People of course talk about these things. Analyzing email sentiment on an ongoing basis will give us a means to impute any changes in sentiment against announcements that are made and impact on morale. Certain geographies may be more impacted by lay-off announcements than others - this will show when we analyze sentiment by geography. Some announcements might cause immediate negative sentiment that dissipates quickly, while others might cause negativity that persists over time. All this can be factored into not only what is announced, how it is announced (e.g. by a CEO in a town-hall to appear more humane vs. via email), and what words are used to convey the information.
Ongoing Employee Engagement Sentiment
How do employees feel about the company? Today, most firms have an annual survey where every employee fills out a questionnaire regarding their work environment. However, perhaps a better measure of engagement can be obtained through periodic email analysis - this will tell us what people are thinking on an ongoing basis, not just once at the end of the year. Besides, while people might think one way and say something else during a survey, sentiment in ongoing communications like email is likely to be much more honest and a more relevant indicator of how management is performing.
Sentiment for Incentive Design
What makes each employee tick? Some like being praised for their work. Others want to get paid more. Some others might want a better title. Yet others might want more of a certain kind of work and less of something else. Not all employees have engaged managers they can speak with about these things. All of this can be mined from email and chat data, and HR can use this to structure incentives to promote retention and reduce attrition of strong performers.
Sentiment for Right-Sizing Organizations
“Right-Sizing” is a convenient euphemism senior management uses for laying people off. While these decisions are painful, the process of deciding which people to let go is many a time made poorly. A recent article (need reference) explored how sometimes people viewed by senior managers as being in the bottom 10% of performers are actually those that are the glue that holds a group together. They spend their time ensuring all their team-mates are successful, so letting go those people actually hurts your firm’s performance more than the savings in their salary helps the firm’s bottom-line. Mining email and text within the context of the firm’s social network (informal connections superimposed on the org-charts) can help organizations make these painful decisions more appropriately.
Shades of 1984? Big Brother Watching?
Of course, privacy advocates will not like any of the above, and there is some validity to contrary viewpoints on the use of work email for the purposes stated above. However, most employers today, particularly those in areas like Financial Services, have very clear policies in place that indicate that “no employee can have any reasonable expectation of privacy for any communication carried out using office equipment or their office email address”… and that they consent for this information to be logged, and records kept for a period of seven years from the date of communication. The retention policy may vary from industry to industry and firm to firm, but most firms today have a data retention policy for legal reasons, so it is not altogether unreasonable that applications such as the above will become practical in time to come…
Besides, Google auto-reads Gmail content to pick ads to display, so it is not entirely far-fetched for work email to be read to improve organizational policy, especially when such determinations are made using data in the aggregate where individual users are not identified (the same thing is done in surveys today).
A valid argument against doing what we discuss in this paper can be made if one asks whether people will still honestly communicate their sentiments in email if they know they are being mined. After all, to give a concrete example, did Goldman employees not take to writing “LDL” (Let’s Discuss Live) when the “Fabulous Fab” and Rajat Gupta scandals broke, when they knew their electronic communications were being tapped by law enforcement agencies? True, but the very fact that these abbreviations show up in written records tells us something about sentiment.
[G] Goldstein Scale for WEIS Data: http://web.pdx.edu/~kinsella/jgscale.html