Thursday, October 14, 2021

2021 | Solving the 8 Queens Problem

 The Queen (sometimes also called The Minister) is the most versatile piece on the chess board. Queens can move multiple steps diagonally like camels or bishops and can move along straight lines like rooks or elephants. The only thing they cannot do is jump over obstacles like the knight or horse pieces. 


The 8-Queens Puzzle asks that we place 8 such queen pieces on standard 8x8 chessboard so that no queens attack each other. Sounds easier than it is, so give this a try. The Wikipedia article presents the problem, its history, and the set of solutions that work. In this post, we study how we might solve this problem on a computer. 

One solution i.e. one configuration that meets the specification to this problem is below. The article also tells us that there are a total of 92 possible solutions that are each derived from a set of 12 "base solutions" through operations like reflection and rotation.  Here's also a video that discusses the problem in more detail.

Representation

We see that any valid solution must have at most one queen in every row and column. So, minimally before we check the diagonal constraints, if we were to even construct solution configurations, they would be 8 tuples, or lists of 8 elements, with digits from 1 to 8 (given the 8x8 size of the chessboard), with no repeating digits. So the above chessboard would be represented as [5, 3, 1, 7, 2, 8, 6, 4] i.e. column 1 has a queen at the fifth row from the bottom, column 2 with one at the 3rd row from the bottom, etc.

Invariants

Rotation and reflection do not change the validity of a valid solution. To see this, imagine the above chess board transformed into one represented as [4, 6, 8, 2, 7, 1, 3, 5]. This is merely a reflection of the above board along a horizontal axis. The solution remains valid. 

Next consider the chessboard represented by turning the above board 90 degrees in the clockwise direction. We end up with [6, 4, 7, 1, 8, 2, 5, 3]. Also a valid configuration of pieces for the 8 queens problem. 

Of course, other transformed configurations of the pictured board are possible, by rotating it 180 degrees, 270 degrees, and by reflecting about the other (i.e. vertical) axis as well. 

We solve this two ways - first, through using traditional computing, and second using genetic algorithms. In the first case, we solve the problem to find the full set of 92 possible configurations, then "whittle away the shadows" to find the core set of 12 solutions. With genetic algorithms, we outline how this works with code, then look at at least one solution the program generates. Let's get started. 

Types of Approaches

There are different ways one might solve this problem. One way would be to generate all possible configurations with queens placed on different rows and columns, and then check which of these is valid. Given the search space is relatively small i.e. 8*7*6*5*4*3*2 = 40320 possibilities (convince yourself you understand why this is so before reading further), we can easily examine the full space to find possible solutions. After finding the full set of solutions, we can eliminate reflections and rotations to be left with the "core set of configurations of interest".

A second traditional computing approach might be to start by placing the first queen in one of the rows on column 1, then progressively place new queens on as-yet-unused rows in columns to come, all while checking the new configuration is such that they do not yet attack each other. If they do, then try other possible rows in that column, before backtracking to an earlier column to move that queen to a different row, then move forward again trying to fit the next queen, and so on till you find a configuration that solves the problem. This method uses branch-and-bound and backtracking ideas from computer science. But again, given the search space is so small to examine with the computer, we just use the approach from the previous paragraph and ignore this one. 

A third approach we can try is one that uses Genetic Algorithms. We explain this in detail in a later section. 

Traditional Search-Based Solution

import itertools;
s=[1,2,3,4,5,6,7,8];

x=itertools.permutations(s,8);
t=[];
for i in x: t+=[i];

First, we use the itertools module and generate all possible unique permutations of the digits 1 through 8. These are the configurations we test.

def pconv(pm):
 u=[];
 for i in range(1,9): u+=[(i,pm[i-1])];
 return u;

def punconv(u): return tuple([i[1] for i in u]);

Then we define two helper functions. pconv converts a board from [6, 4, 7, 1, 8, 2, 5, 3] to [(1,6), (2,4),(3,7), (4,1), (5,8), (6,2), (7,5), (8,3)] - this helps with computations, since each position is reduced to a set of coordinates in the x-y grid. punconv performs the reverse transformation so we can still print things out compactly. 

def dchk(j): #cth col has queen in row n
 c,n=j[0],j[1];
 v=[]
 k1,k2=n,n;
 for i in range(c-1,0,-1):
  k1-=1;
  if k1>0: v+=[(i,k1)];
  k2+=1;
  if k2<9: v+=[(i,k2)];
 k1,k2=n,n;
 for i in range(c+1,9,1):
  k1-=1;
  if k1>0: v+=[(i,k1)];
  k2+=1;
  if k2<9: v+=[(i,k2)];
 return v;

The very important function dchk takes as input a 2-tuple (c,n) which indicates a queen is placed in column c in row n, and computes all other cells on the board the queen attacks diagonally (horizontal and vertical attacks are already accounted for by not allowing any 2 queens from being in the same row across columns in our setup). So, for example, if a queen is placed at (5,4), then dchk would return a list of positions including (4,3), (4, 5), (3, 2), (3, 6), ... to the left, and (6, 5), (6, 3), (7, 6), (7,2), ... to the right. We can then check if these cells are occupied by other queens, and if so, disqualify a solution candidate. 

def rotate(v): # rotated board configurations
 N=8;
 r=[];
 b=pconv(v);
 b1=b[:];
 u=[];
 while u!=b:
  u=[];
  for i in b1: u+=[(i[1],N-i[0]+1)];
  u.sort(key=lambda tup: tup[0]);
  #print(u);
  b1=u[:];
  r+=[punconv(u[:])];
 return r[:-1];

def reflect(v): # reflected board configurations
 N=8;
 b=pconv(v);
 m=[];
 for i in b: m+=[(i[0],8-i[1]+1)];
 b.reverse();
 return [tuple(punconv(m)),tuple(punconv(b))];

These two functions are the rotation and reflection transforms discussed earlier, presented above in code form. Pay particular attention to the rotate function. This is an oft-asked programming question in job interviews. It can be mildly challenging to write correctly under pressure.

Now, with all the pieces in place, the next steps are relatively easy. We check each of the candidates generated from permutations of the set [1, 2, 3, 4, 5, 6, 7, 8] to see if there are collisions (i.e. pieces attack each other), and if not, we have a solution! 

u=[];
for i in t:
 f=0;
 x=pconv(i);
 for j in x:
  v=dchk(j);
  if len(list(set(v).intersection(set(x))))!=0:
   f=1;
   break;
 if f==0: u+=[i];

Here, we take each candidate generated using itertools earlier, then check if any of the diagonal attack positions from dchk for each queen on the board are occupied by other queens. If not, we have a solution! So with this loop complete, u is a list of possible solutions.We can verify, there are 92 when this part of the code completes. This agrees with what the Wikipedia article says. But wait, before popping the champagne, we need to eliminate duplicates - the reflections and rotations we spoke of earlier!

v=u[:];
rr=[];
dx={};
for i in u:
 r=reflect(i);
 if r[0] in v: v.remove(r[0]);
 if r[1] in v: v.remove(r[1]);
 if i in v:
  dx[i]=[];
  dx[i]+=[r[0],r[1]];
  rr+=[i];
  x=rotate(i);
  dx[i]+=x;
  for k in x:
   if k in v: v.remove(k);
   r=reflect(k);
   if r[0] in v: v.remove(r[0]);
   if r[1] in v: v.remove(r[1]);
   dx[i]+=[r[0],r[1]];
  dx[i]=list(set(dx[i]));

We do that in this above code. For each solution, we generate reflections and tag them against that solution, we do the same with rotations. All the while, we remove these transformed versions from those we capture in the unique solution set (variable rr). When the loop completes, for each unique solution in rr, we have it as a key in dictionary dx with a list of reflections and rotations that correspond to it. Cool, huh? 

With that, we are done. We have all the 12 unique solutions the Wikipedia article lists. We can verify this easily.

The Genetic Algorithms Approach

Now let's look at solving the problem with genetic algorithms (GA). This post is already too long, so I won't explain all the details of genetic algorithms here, but will present the code outlining the main points of the solution. Our goal here is to find at least one solution. To find more solutions, simple run the code for "more generations". 

Generally speaking the GA code proceeds in the following steps: 

  1. Find a shape for candidate solutions. Generate a few and use them as the "base gene pool". We call these candidate chromosomes. 
  2. Define a fitness function to evaluate how close the candidates are to solving the problem. In each "generation" we keep the best candidates, discard the rest. 
  3. Define reproduction functions for good candidates so you inch closer to a solution. Reproduction is of two types a. evolution through mutation - take random bits in the candidate and flip them, b. crossover - take two parent chromosomes, generate two progeny from them snipping each parent at the same point, and crossing over the two segments - this is akin to "sexual reproduction".
  4. Only put valid new chromosomes in the pool. At each generation check if you already have a solution using the fitness function. If you do, stop. Otherwise, propagate further.

Simple! Now let's do this in code.

First we define a set of candidates we call chromosomes in the GA context. These are 8-tuples just like in the previous code-base from earlier in the article e.g. [6, 4, 7, 1, 8, 2, 5, 3] is a chromosome. 

We define a "fitness function". This tells us how close a particular candidate is to being a solution. The fitness function we use below is one that counts the number of "intersections" (i.e. instances when two queens attack each other in a solution candidate). If this value is 0, we have a solution!

def xsection(v): # fitness function
 # returns a measure of quality for a configuration pm
 # 0 metric means v solves the problem
 r=0;
 b=pconv(v);
 for i in b:
  ds=dchk(i);
  r+=len([k for k in b if k in ds]); # number of collisions
 return r;

Next, we define the mutate and crossover functions that simulate the evolution of a candidate, and "sexual reproduction" between two candidate solutions to generate better solutions over time from the gene pool we have. Of course, we pick only the more promising solutions (based on their fitness values) to generate new candidates in future generation. This is akin to Darwin's "survival of the fittest" in action. Along with mutation and crossover, we also add new chromosome material to the gene pool, ensuring that all new chromosomes that are included are valid ones i.e. use digits 1-8 once and only once.

def mutate(c1): # mutate the chromosome
 n=random.sample([1,2,3,4,5,6,7,8],4);
 t1,t2=c1[n[0]],c1[n[2]];
 c1[n[0]]=c1[n[1]];
 c1[n[2]]=c1[n[3]];
 c1[n[1]]=t1;
 c1[n[3]]=t2;
 return c1;

def regen(): # generate new chromosome
 return random.sample([1,2,3,4,5,6,7,8],8);

def validc(c1):
 r=[1,2,3,4,5,6,7,8]
 x=[c1.count(i) for i in r];
 if x.count(1)==8: return True;
 return False;

def xover(c1,c2): # crossover c1 and c2 chromosomes
 t=[1,2,3,4,5,6,7,8];
 k1,k2=random.choices(t,k=8),random.choices(t,k=8);
 k1,k2=c1[:cpt]+c2[cpt:],c1[cpt:]+c2[:cpt];
 while validc(k1)==False or validc(k2)==False:
  cpt=random.randint(1,8);
  k1,k2=random.choices(t,k=8),random.choices(t,k=8);
  k1,k2=c1[:cpt]+c2[cpt:],c1[cpt:]+c2[:cpt];
 return (k1,k2);

Now we set up the gene pool defining its size. 

pool=[];
pool_sz=5000;

def create_gene_pool():
 # create gene pool with 5000 chromosome seeds
 global pool,pool_sz;
 while len(pool)<pool_sz:
  candidate=regen();
  if candidate not in pool: pool+=[candidate];
 return pool;

We set up variables to detect if a solution has been found, and gather up the solutions as they are found. The rank_pool_chromosomes function ranks each chromosome in the pool using the fitness function so we can keep only the top x%, throw out the rest, and use the retained ones to "breed" candidates for generations to come.

solved=0;
solutions=[];

def rank_pool_chromosomes():
 global solved, pool, solutions;
 # rank pool by chromosome quality using fitness function
 r=[];
 for i in pool: r+=[(i,xsection(i))];
 r.sort(key=lambda tup: tup[1]);
 if r[0][1]==0:
  solved=1;
  solns=[i for i in r if i[1]==0];
  for s in solns: solutions+=[s];  
 # reduce back to pool format
 pool=[i[0] for i in r];
 return pool;

The nexgen_pool function creates the next generation of the gene pool. The pct input controls how much of the best parts of the current generation we retain, while the mp variable allows us to set the mutation percentage. With 10% retained for new genetic material to be introduced, 1-mp%-10% is the allocation the code makes for new chromosomes generated via the crossover method.

def nexgen_pool(pct,mp):
 # repopulate pool replacing bottom pct % with better
 # mp is mutate %, xp is crossover % = 1-mp-0.1
 global pool,pool_sz;
 pool=pool[:int(pct*pool_sz)]; # truncate pool
 numm=int((pool_sz-len(pool))*mp); # number of new mutate chromosomes
 numx=pool_sz*0.9-len(pool)-numm; # number of crossover chromosomes
 if numx%2!=0:
  numx-=1;
  numm+=1;
 new_c=[]; # new chromosomes
 for i in range(numm):
  c1=random.sample(pool,1);
  t1=mutate(c1);
  if validc(t1) and t1 not in pool: pool+=[mutate(c1)];
 for i in range(int(numx/2)):
  c1,c2=random.sample(pool,2);
  t1,t2=xover(c1,c2);
  pool+=[t1,t2];
 while len(pool)<pool_sz:
  t=regen();
  if t not in pool: pool+=[t];
 return;

Finally, we come to the main body of the code. We create the gene pool first, then until a solution is found, we rank the chromosomes we have, keep the top 20%, mutate them to generate 60% of new chromosomes, with 10% of new ones, and 30% new chromosomes being created using crossover. We keep doing this until we find (a) solution(s).

create_gene_pool();
num_gen=0;
while solved==0:
 print('generation:',num_gen);
 rank_pool_chromosomes();
 nexgen_pool(0.2,0.6);
 num_gen+=1;

print('solutions are:');
for i in solutions: print(i[0]); 

When done, the code prints the below 10 solutions after just a couple of generations. Not bad. Given there is some randomness here, your results may be different. You can check these for correctness against the Wikipedia article.

[2, 5, 7, 1, 3, 8, 6, 4]
[3, 6, 8, 1, 4, 7, 5, 2]
[5, 7, 2, 4, 8, 1, 3, 6]
[4, 7, 5, 3, 1, 6, 8, 2]
[4, 2, 7, 5, 1, 8, 6, 3]
[6, 4, 7, 1, 8, 2, 5, 3]
[5, 7, 1, 4, 2, 8, 6, 3]
[4, 2, 7, 3, 6, 8, 1, 5]
[2, 4, 6, 8, 3, 1, 7, 5]
[6, 3, 1, 8, 4, 2, 7, 5]

There you have it - the 8 Queens Problem solved two different ways. Some interesting videos on various approaches:

 

 Have you done this other ways I don't discuss here? Please share in the comments.







Wednesday, October 13, 2021

2021 | Primes!

Prime numbers (positive integers that are only divisible by themselves and 1) are fascinating and have many applications in areas including encryption. In this post, we look at different ways to compute primes and at optimizations we can make to speed up the code as we go. Like true geeks, before we begin, we pay homage to Optimus Prime, the AutoBot from the Transformers that people think first of when anyone says "Prime" - our nod to pop-culture.

Optimus Prime patent: Hiroyuki Obara., Public domain, via Wikimedia Commons

I recently came across this book below, and was (re)fascinated with primes reading some of the articles there. This book was published by the same people behind Quanta magazine. And James Glieck of Chaos fame wrote the foreward. I like his work a lot. Some references: 

 

Solution Approaches & Python Code

Starting out, the below was a naive approach I considered. Factorize every integer and print those as primes that have only itself and 1 as factors. 

def factorize(n): 
 r,f,t=[],2,n; # start with 2 as a factor
 while t>1: 
  while t%f==0: 
   r+=[f];
   t/=f;
  f+=1;
 return r;

n=100; # search for primes <=100
print([i for i in range(n) if len(factorize(n))==1]);

As you can probably imagine, this will not work for finding larger primes (say till 50 million) in a timely way. The next approach we can consider is the Sieve of Eratosthenes. The way this works is by writing out all the numbers starting with 2 to the top end of your range, marking them all as prime to start with. Then you take the first prime candidate, mark it as a confirmed prime and mark all of its multiples in the range as non-primes. You continue this till there are no multiples left to mark as non-primes, and you're left with just the primes. The link provided animates this process, so you can see visually how this works. This is truly an ancient method (from ancient Greece as the name indicates), and can be sped up with computers. However, you still need to create a large array (e.g. you are mining for all primes less than 1 billion), so some optimizations are called for to squeeze out efficiencies in code.


An approach called the "Segmented Sieve of Eratosthenes" is used, which is illustrated in the code below. There are also several additional optimizations done in this code, which are explained in what follows. Note however that the code below can be further improved by using NumPy arrays if in python, or by re-writing it entirely in C, a compiled language. I leave the code below in its raw form with simple data structures to make the logic easier to follow.

def extend(n):
 v=[2,3,5];
 u=v[:];
 for lc in range(n):
  sv=v[-1];
  tn=sv*sv;
  if sv>250000: tn=round(sv+1000000,-6);
  print(u,len(v),tn);
  #k=input();
  u=[i for i in range(sv,tn,2) if (i+1)/6==int((i+1)/6) or (i-1)/6==int((i-1)/6)];
  ul=round(u[-1]**0.5);
  ul=max([i for i in v if i<=ul])
  for j in v[1:v.index(ul)+1]: u=[i for i in u if i%j!=0];
  if u[0] in v: v+=u[1:]; # avoid repeats across range extensions
  else: v+=u;
 return v;

Let's examine this line by line. I first seed the code with a vector of the first 3 primes called v.  Then, I use the loop counter lc which iterates the indicated n number of times through the logic. Each time, the code finds primes using the Sieve function between the largest prime captured so far (v[-1]), and the top end of the examined range being either the square of the largest prime found so far or the largest prime found so far plus 1 million, rounded to the nearest million. We do this to keep the range reasonable. Then, within this range to be examined, we generate all odd numbers that can be expressed in the form (6n+1) or (6n-1) where n is an integer - this is because all primes MUST satisfy this condition - these are the prime number candidates in that range. Finally, we check if each number is divisible by any of the primes we already found, and those that aren't divisible by any we have found so far are primes we add to our list. Very intuitive, really. 

Optimizations used: 
  1. Let's discuss the (6n+1) or (6n-1) constraint. If a number is of the form 6n, it is divisible by 6 (duh!), also 6n+2 and 6n+4 are divisible by 2, and 6n+3 is divisible by 3. This leaves 6n+1 and 6n+5 (or 6n-1) as the only numbers in modulo 6 arithmetic that are possible prime candidates. We use that idea here as an optimization.
  2. If we consider the Sieve, each prime p is essential to eliminate multiples starting at p**2. We use this idea in the two lines that start with ul in the code above. Let's illustrate this by means of an example. Consider 5. Multiples of 5 are 10, 15, 20, 25, 30, 35, .... Now, all numbers below 25 (5**2) are divisible by other primes smaller than 5 e.g. 10=2*5, 15=3*5, 20=2*2*5. But 25 can only be "disqualified" using 5 as a divisor. What this means is that when checking for divisors of a number N, we only need to check all the found primes from 2 till sqrt(N).
Can we do better with the same data structures before we convert to using NumPy arrays? Certainly. We can pickle the prime number list into a file, and load and restart the search grid for new range extensions as we like, at later times, then pickle and save the data structure again. This is done with the code below. The first block to pickle the data structure into a file, and a second to load the data structure into the program. 

import pickle; # pickle to save the data structure
g=open('primes_20211013.pkl','wb');
pickle.dump(p,g);
g.close();

import pickle; # pickle to 
f=open('primes_20211013.pkl','rb');
x=pickle.load(f);
f.close();

With this, we have to update our extend function from the above to extend2 as below, so it loads the data from the pickled object for the next set of extensions. So you can mine one range in one day, and continue mining the next. Of course, things will go blazingly fast(er) if you convert data structures to use NumPy arrays instead if the code is still to be in python, or better yet if you use C.

def extend2(n): # grows prime set read from pickled data
 f=open('primes_20211013.pkl','rb'); 
 v=pickle.load(f);
 f.close();
 u=v[:];
 for lc in range(n):
  sv=v[-1];
  tn=sv*sv;
  if sv>250000: tn=round(sv+1000000,-6);
  print(u,len(v),tn);
  #k=input();
  u=[i for i in range(sv,tn,2) if (i+1)/6==int((i+1)/6) or (i-1)/6==int((i-1)/6)];
  ul=round(u[-1]**0.5);
  ul=max([i for i in v if i<=ul])
  for j in v[1:v.index(ul)+1]: u=[i for i in u if i%j!=0];
  if u[0] in v: v+=u[1:];
  else: v+=u;
 return v;

With this, let's look at a couple of videos where people explore some of these concepts in code.

    

In addition to deterministic methods, there are also probabilistic methods for prime number detection. A popular one is called the Hawkins Sieve. There are others too. Some interesting papers [1], [2]. and StackOverflow questions [3][4]. As indicated, for larger ranges, the Atkins Sieve may be more efficient.

Have you tried other techniques to generate primes that are more efficient? Please share in the comments.

Friday, October 8, 2021

2021 | The 9 Pitfalls of Data Science

In this blog post, we discuss the book, The Nine Pitfalls of Data Science by Gary Smith and Jay Cordes.

 
In short, the 9 pitfalls covered in the book are:

  1. Using Bad Data - you have to have the right data to answer any question, and then use have to use the data right. Stanford's Hastie and Tibshirani, in their amazing introductory statistical learning course take pains to point out how to structure and run through a data study. It always starts with validating the data set you have to see a. if it is the right data, b. if the data is right, c. if it can be fixed for use to answer the question at hand. Incidentally, if you are interested in their work, I cannot recommend their excellent book on Statistical Learning enough - they even made it available for free download via their website!

  2. Putting Data Before Theory - The authors describe the Texas Sharpshooter Fallacy - either "p-hacking" to ex-post find conclusions the data supports, or defining expected experimental outcomes after the experiment is complete. They liken this to a sharpshooter either a. picking bulls-eye targets he shot after all shots are fired, or b. painting a bulls-eye around a bullet hole for a shot already fired. This is perhaps the most important pitfall described in the book, and all data scientists should take careful note of this. Marcos Lopez de Prado, financial data scientist extraordinaire who at the time of this writing works at ADIA (Abu Dhabi Sovereign Wealth Fund) makes this point quite forcefully with excellent examples in his book as well - not an easy book, but required reading for serious data science professionals.

  3. Worshipping Math - the most successful data scientists are not those that can build the most sophisticated models, but those that are able to reason from first principles and build explainable models that capture the spirit of the data-set, and deliver quality insights on time. In fact, some of the best data scientists use the simplest models that can generalize well. Data Science done right is also an art form. To be a competent data scientist, you must know the math. But it is equally important to have an abundance of common sense.
  4. Worshipping Computers - they say more data beats a better algorithm some of the time. There are books that explore how best to leverage computers for statistical inference, and how the field of inferential statistics has evolved as computer power has grown by leaps and bounds in recent years with the wider deployment and use of GPUs and more sophisticated compute infrastructure to solve targeted or niche problems. However, good data scientists understand that compute power doesn't substitute for being able to really understand and leverage data to generate insights.

  5. Torturing Data - Andrew Lo, highly respected professor of quantitative finance at MIT's Sloan School of Management is known to have said "Torture data long enough, and it will confess to anything". This is a common problem one sees in nascent or budding data practices in some firms, where management presses their staff to improve result accuracy by a certain % given a particular data set. The managers do not know how data science works, do not know what quality of signal the data set affords, but press their staff threatening poor reviews unless the managers' requirements are met. Good data scientists will exit such firms as soon as possible. Part of a data scientist's job is to educate senior managers and executives, so they know what is possible, and the firm doesn't become a victim of the data science hype cycle.

  6. Fooling Yourself: "... and you are the easiest person to fool." One issue is that people doing the analysis really really want to believe they have generated meaningful results. Some less scrupulous ones might even fudge things at the edges, or might be pressured by their managers to. "After all, what difference can a small change in the analysis make? Turns out, sometimes this might have far reaching consequences.
  7. Confusing Correlation with Causation - this item and the next one are somewhat more technical. Human nature likes to believe that it can construct plausible explanations when it sees patterns in data. Some of the patterns may be fleeting, some may be unrelated coincidental occurrences, while other phenomena may not have a causal relationship between them at all but be caused by a third hidden phenomenon. Assuming a relationship exists where it doesn't actually isn't just of less use, it can also hurt when the expected behavior does not repeat itself. An excellent example of this kind of situation is illustrated in the book below.

  8. Being Surprised by Regression Towards the Mean - an oft-told tale is about fighter pilots in training. Those that did poorly in a session and got yelled at did better in the next one, while those that excelled and got praised did less well in the next session. Did the praise or admonition really make the difference? Or was it simply that the sessions' poor performance and out-performance were deviations from the norm, and things reverted to the norm quickly in the sessions after? Data Scientists should keep this idea in mind.
  9. Doing Harm - doctors take the Hippocratic Oath which states "first, do no harm" as they practice medicine. While there isn't an equivalent oath for data scientists, true professionals would do well to keep that in mind as they embark on solving data problems for the Enterprise, particularly where the consequences of errors can be large or devastating for employees, customers, shareholders, or other stakeholders. Some years ago, there was talk of having a similar oath being administered to MBA students graduating from top schools. Not sure where that initiative stands today.

What I liked about this book is that the arguments are structured and presented in a clear and compelling way. Being a seasoned data practitioner and senior data scientist myself, I have seen, and can appreciate the need for a book like this, particularly to aid the unwary or the neophytes of potential issues that lurk in the dark. In addition, reading this book would also serve as a refresher to seasoned data professionals about things they knew and perhaps forgot as they moved through the mists of time. 

Having said that, I believe some of the examples they used could have been better chosen, structured or presented to make their points more forcefully. This isn't an academic tome. It is a highly readable treatise, so more examples presented more lucidly would only add to the quality of the presentation and contribute to greater reader impact. In that sense perhaps an improved second edition would add more value. But... Definitely worth the relatively easy read nonetheless. 

There are also lists of much more technical aspects in data science that merit careful consideration by practitioners in the field. One such talk by Mark Landry from H20.ai is linked below.