In this blog post, we discuss the book, The Nine Pitfalls of Data Science by Gary Smith and Jay Cordes.
- Using Bad Data - you have to have the right data to answer any question, and then use have to use the data right. Stanford's Hastie and Tibshirani, in their amazing introductory statistical learning course take pains to point out how to structure and run through a data study. It always starts with validating the data set you have to see a. if it is the right data, b. if the data is right, c. if it can be fixed for use to answer the question at hand. Incidentally, if you are interested in their work, I cannot recommend their excellent book on Statistical Learning enough - they even made it available for free download via their website!
- Putting Data Before Theory - The authors describe the Texas Sharpshooter Fallacy - either "p-hacking" to ex-post find conclusions the data supports, or defining expected experimental outcomes after the experiment is complete. They liken this to a sharpshooter either a. picking bulls-eye targets he shot after all shots are fired, or b. painting a bulls-eye around a bullet hole for a shot already fired. This is perhaps the most important pitfall described in the book, and all data scientists should take careful note of this. Marcos Lopez de Prado, financial data scientist extraordinaire who at the time of this writing works at ADIA (Abu Dhabi Sovereign Wealth Fund) makes this point quite forcefully with excellent examples in his book as well - not an easy book, but required reading for serious data science professionals.
- Worshipping Math - the most successful data scientists are not those that can build the most sophisticated models, but those that are able to reason from first principles and build explainable models that capture the spirit of the data-set, and deliver quality insights on time. In fact, some of the best data scientists use the simplest models that can generalize well. Data Science done right is also an art form. To be a competent data scientist, you must know the math. But it is equally important to have an abundance of common sense.
- Worshipping Computers - they say more data beats a better algorithm some of the time. There are books that explore how best to leverage computers for statistical inference, and how the field of inferential statistics has evolved as computer power has grown by leaps and bounds in recent years with the wider deployment and use of GPUs and more sophisticated compute infrastructure to solve targeted or niche problems. However, good data scientists understand that compute power doesn't substitute for being able to really understand and leverage data to generate insights.
- Torturing Data - Andrew Lo, highly respected professor of quantitative finance at MIT's Sloan School of Management is known to have said "Torture data long enough, and it will confess to anything". This is a common problem one sees in nascent or budding data practices in some firms, where management presses their staff to improve result accuracy by a certain % given a particular data set. The managers do not know how data science works, do not know what quality of signal the data set affords, but press their staff threatening poor reviews unless the managers' requirements are met. Good data scientists will exit such firms as soon as possible. Part of a data scientist's job is to educate senior managers and executives, so they know what is possible, and the firm doesn't become a victim of the data science hype cycle.
- Fooling Yourself: "... and you are the easiest person to fool." One issue is that people doing the analysis really really want to believe they have generated meaningful results. Some less scrupulous ones might even fudge things at the edges, or might be pressured by their managers to. "After all, what difference can a small change in the analysis make? Turns out, sometimes this might have far reaching consequences.
- Confusing Correlation with Causation - this item and the next one are somewhat more technical. Human nature likes to believe that it can construct plausible explanations when it sees patterns in data. Some of the patterns may be fleeting, some may be unrelated coincidental occurrences, while other phenomena may not have a causal relationship between them at all but be caused by a third hidden phenomenon. Assuming a relationship exists where it doesn't actually isn't just of less use, it can also hurt when the expected behavior does not repeat itself. An excellent example of this kind of situation is illustrated in the book below.
- Being Surprised by Regression Towards the Mean - an oft-told tale is about fighter pilots in training. Those that did poorly in a session and got yelled at did better in the next one, while those that excelled and got praised did less well in the next session. Did the praise or admonition really make the difference? Or was it simply that the sessions' poor performance and out-performance were deviations from the norm, and things reverted to the norm quickly in the sessions after? Data Scientists should keep this idea in mind.
- Doing Harm - doctors take the Hippocratic Oath which states "first, do no harm" as they practice medicine. While there isn't an equivalent oath for data scientists, true professionals would do well to keep that in mind as they embark on solving data problems for the Enterprise, particularly where the consequences of errors can be large or devastating for employees, customers, shareholders, or other stakeholders. Some years ago, there was talk of having a similar oath being administered to MBA students graduating from top schools. Not sure where that initiative stands today.
What I liked about this book is that the arguments are structured and presented in a clear and compelling way. Being a seasoned data practitioner and senior data scientist myself, I have seen, and can appreciate the need for a book like this, particularly to aid the unwary or the neophytes of potential issues that lurk in the dark. In addition, reading this book would also serve as a refresher to seasoned data professionals about things they knew and perhaps forgot as they moved through the mists of time.
Having said that, I believe some of the examples they used could have been better chosen, structured or presented to make their points more forcefully. This isn't an academic tome. It is a highly readable treatise, so more examples presented more lucidly would only add to the quality of the presentation and contribute to greater reader impact. In that sense perhaps an improved second edition would add more value. But... Definitely worth the relatively easy read nonetheless.
There are also lists of much more technical aspects in data science that merit careful consideration by practitioners in the field. One such talk by Mark Landry from H20.ai is linked below.
No comments:
Post a Comment