In this post we explore a couple more simple coding interview questions. These were recently asked of 4th grade math students to be solved with paper and pencil, so should be fair game for job interviews. My approach below is not the most optimal for any of the questions, but the solution is easily coded in 3-5 min. You should of course try to improve on these. And solving these analytically with paper and pencil is "left as an exercise to the reader".
Question 1: using the digits from 1 through 9 only once each, find two numbers such that the second is twice the first.
My approach here starts with common sense that I then codify into the program. Given the digits 1 to 9 only, we can split this into a 4 digit number (say x) and a 5 digit number (y say), with the 5 digit number being double the 4 digit number. Before reading further, convince yourself that the leftmost digit of y has to be a 1 for the math to work. The code below works regardless of whether you figure this out or not, but setting the parameters with this understanding simplifies the calculations and makes your code more efficient. Here it is:
import itertools; # useful in solving combinatorics problems like this one
def nrep(y): # returns True if no digit repeats in y t=str(y); for i in t: if t.count(i)>1: return False; return True;
def xdig(x,y): # returns True if there is no digit overlap between x and y return len(list(set(str(x)).intersection(set(str(y)))))==0;
def dchk(x,n): # returns True if n is a digit of x return str(n) in str(x);
def nstr(a,b): # returns a string of int a followed by int b return str(a)+str(b);
def alldigs(N): # returns True if N has all digits from 1 to 9 inclusive N=str(N); for i in range(1,10): if N.count(str(i))!=1: return False; return True;
digs=[2,3,4,5,6,7,8,9]; # can try this with 1 included, will yield same answer
x=itertools.permutations(digs,4); # generate all permutations r=[]; # list will hold answers when done for i in x: n=i[0]*1000+i[1]*100+i[2]*10+i[3]; # construct 4 digit number from digit permutation y=2*n; # y is 2 times x if nrep(y) and xdig(n,y) and not dchk(y,0) and alldigs(nstr(n,y)): r+=[(n,y)];
There are 12 answers to this problem. The number pairs that work for x and y are: [(6729, 13458), (6792, 13584), (6927, 13854), (7269, 14538), (7293, 14586), (7329, 14658), (7692, 15384), (7923, 15846), (7932, 15864), (9267, 18534), (9273, 18546), (9327, 18654)]
Question 2: The digits from 0 through 9 inclusive should show up in the product below only once. What are the numbers that satisfy this equation? (a, b, c, d, e, f, g, h) are all digits.
abc x de = fg0h8
To solve this problem, we reuse some of the functions already defined above, but add a couple more to help with our work. To solve this problem, we call the number with digits abc as x and that with digits de as y. Code and answers follow below.
def dig(N,n): # returns the nth leftmost digit of N; n=-1 gives rightmost digit return int(str(N)[n]);
x=[i for i in range(100,1000) if nrep(i)]; y=[i for i in range(10,100) if nrep(i)]; z=[]; # collect answers here for i in x: for j in y: t=i*j; if len(str(t))==5 and dig(t,2)==0 and dig(t,-1)==8 and xdig(t,i) and xdig(t,j) and xdig(i,j) and nrep(t):
z+=[(i,j,t)]; # z holds answers!
Z is: [(297, 54, 16038), (594, 27, 16038)] - there are 2 possible solutions.
Why is it important and useful to learn to solve these kinds of problems in python?
With the pandemic, interviewing and hiring looks more at a global talent pool. I regularly interview candidates remotely. How then can I ask them to code something in a coding test or otherwise, and get a quick sense of whether they are able to code correctly let's say even if I am not able to see their code as they are coding it, but just at the end? Well, problems like these where there is a definite correct answer that can be verified, and where the answer cannot be easily googled (or "binged" if that is also a verb), are useful in such situations. I use these kinds of questions regularly these days.
Another couple of interesting questions, left to the reader:
True or False: the 30th leftmost digit of the 3000th Fibonacci number is prime. [The Fibonacci series starts 0, 1, 1, 2, 3, ... and each successive number is the sum of the previous two series elements.] This one is trivial for python enthusiasts.
(***) What positive number doubles when its last digit becomes its first digit? Very challenging problem! See how Presh Talwalker solves it here. For those that don't know, his "Mind Your Decisions" YouTube channel has some excellent math puzzles with solutions that help sharpen your thinking.
Please share other interesting puzzles you've come across in the comments. Thanks for reading.
The Queen (sometimes also called The Minister) is the most versatile piece on the chess board. Queens can move multiple steps diagonally like camels or bishops and can move along straight lines like rooks or elephants. The only thing they cannot do is jump over obstacles like the knight or horse pieces.
The 8-Queens Puzzle asks that we place 8 such queen pieces on standard 8x8 chessboard so that no queens attack each other. Sounds easier than it is, so give this a try. The Wikipedia article presents the problem, its history, and the set of solutions that work. In this post, we study how we might solve this problem on a computer.
One solution i.e. one configuration that meets the specification to this problem is below. The article also tells us that there are a total of 92 possible solutions that are each derived from a set of 12 "base solutions" through operations like reflection and rotation. Here's also a video that discusses the problem in more detail.
Representation
We see that any valid solution must have at most one queen in every row and column. So, minimally before we check the diagonal constraints, if we were to even construct solution configurations, they would be 8 tuples, or lists of 8 elements, with digits from 1 to 8 (given the 8x8 size of the chessboard), with no repeating digits. So the above chessboard would be represented as [5, 3, 1, 7, 2, 8, 6, 4] i.e. column 1 has a queen at the fifth row from the bottom, column 2 with one at the 3rd row from the bottom, etc.
Invariants
Rotation and reflection do not change the validity of a valid solution. To see this, imagine the above chess board transformed into one represented as [4, 6, 8, 2, 7, 1, 3, 5]. This is merely a reflection of the above board along a horizontal axis. The solution remains valid.
Next consider the chessboard represented by turning the above board 90 degrees in the clockwise direction. We end up with [6, 4, 7, 1, 8, 2, 5, 3]. Also a valid configuration of pieces for the 8 queens problem.
Of course, other transformed configurations of the pictured board are possible, by rotating it 180 degrees, 270 degrees, and by reflecting about the other (i.e. vertical) axis as well.
We solve this two ways - first, through using traditional computing, and second using genetic algorithms. In the first case, we solve the problem to find the full set of 92 possible configurations, then "whittle away the shadows" to find the core set of 12 solutions. With genetic algorithms, we outline how this works with code, then look at at least one solution the program generates. Let's get started.
Types of Approaches
There are different ways one might solve this problem. One way would be to generate all possible configurations with queens placed on different rows and columns, and then check which of these is valid. Given the search space is relatively small i.e. 8*7*6*5*4*3*2 = 40320 possibilities (convince yourself you understand why this is so before reading further), we can easily examine the full space to find possible solutions. After finding the full set of solutions, we can eliminate reflections and rotations to be left with the "core set of configurations of interest".
A second traditional computing approach might be to start by placing the first queen in one of the rows on column 1, then progressively place new queens on as-yet-unused rows in columns to come, all while checking the new configuration is such that they do not yet attack each other. If they do, then try other possible rows in that column, before backtracking to an earlier column to move that queen to a different row, then move forward again trying to fit the next queen, and so on till you find a configuration that solves the problem. This method uses branch-and-bound and backtracking ideas from computer science. But again, given the search space is so small to examine with the computer, we just use the approach from the previous paragraph and ignore this one.
A third approach we can try is one that uses Genetic Algorithms. We explain this in detail in a later section.
Traditional Search-Based Solution
import itertools; s=[1,2,3,4,5,6,7,8];
x=itertools.permutations(s,8); t=[]; for i in x: t+=[i];
First, we use the itertools module and generate all possible unique permutations of the digits 1 through 8. These are the configurations we test.
def pconv(pm): u=[]; for i in range(1,9): u+=[(i,pm[i-1])]; return u;
def punconv(u): return tuple([i[1] for i in u]);
Then we define two helper functions. pconv converts a board from [6, 4, 7, 1, 8, 2, 5, 3] to [(1,6), (2,4),(3,7), (4,1), (5,8), (6,2), (7,5), (8,3)] - this helps with computations, since each position is reduced to a set of coordinates in the x-y grid. punconv performs the reverse transformation so we can still print things out compactly.
def dchk(j): #cth col has queen in row n c,n=j[0],j[1]; v=[] k1,k2=n,n; for i in range(c-1,0,-1): k1-=1; if k1>0: v+=[(i,k1)]; k2+=1; if k2<9: v+=[(i,k2)]; k1,k2=n,n; for i in range(c+1,9,1): k1-=1; if k1>0: v+=[(i,k1)]; k2+=1; if k2<9: v+=[(i,k2)]; return v;
The very important function dchk takes as input a 2-tuple (c,n) which indicates a queen is placed in column c in row n, and computes all other cells on the board the queen attacks diagonally (horizontal and vertical attacks are already accounted for by not allowing any 2 queens from being in the same row across columns in our setup). So, for example, if a queen is placed at (5,4), then dchk would return a list of positions including (4,3), (4, 5), (3, 2), (3, 6), ... to the left, and (6, 5), (6, 3), (7, 6), (7,2), ... to the right. We can then check if these cells are occupied by other queens, and if so, disqualify a solution candidate.
def rotate(v): # rotated board configurations N=8; r=[]; b=pconv(v); b1=b[:]; u=[]; while u!=b: u=[]; for i in b1: u+=[(i[1],N-i[0]+1)]; u.sort(key=lambda tup: tup[0]); #print(u); b1=u[:]; r+=[punconv(u[:])]; return r[:-1];
def reflect(v): # reflected board configurations N=8; b=pconv(v); m=[]; for i in b: m+=[(i[0],8-i[1]+1)]; b.reverse(); return [tuple(punconv(m)),tuple(punconv(b))];
These two functions are the rotation and reflection transforms discussed earlier, presented above in code form. Pay particular attention to the rotate function. This is an oft-asked programming question in job interviews. It can be mildly challenging to write correctly under pressure.
Now, with all the pieces in place, the next steps are relatively easy. We check each of the candidates generated from permutations of the set [1, 2, 3, 4, 5, 6, 7, 8] to see if there are collisions (i.e. pieces attack each other), and if not, we have a solution!
u=[]; for i in t: f=0; x=pconv(i); for j in x: v=dchk(j); if len(list(set(v).intersection(set(x))))!=0: f=1; break; if f==0: u+=[i];
Here, we take each candidate generated using itertools earlier, then check if any of the diagonal attack positions from dchk for each queen on the board are occupied by other queens. If not, we have a solution! So with this loop complete, u is a list of possible solutions.We can verify, there are 92 when this part of the code completes. This agrees with what the Wikipedia article says. But wait, before popping the champagne, we need to eliminate duplicates - the reflections and rotations we spoke of earlier!
v=u[:]; rr=[]; dx={}; for i in u: r=reflect(i); if r[0] in v: v.remove(r[0]); if r[1] in v: v.remove(r[1]); if i in v: dx[i]=[]; dx[i]+=[r[0],r[1]]; rr+=[i]; x=rotate(i); dx[i]+=x; for k in x: if k in v: v.remove(k); r=reflect(k); if r[0] in v: v.remove(r[0]); if r[1] in v: v.remove(r[1]); dx[i]+=[r[0],r[1]]; dx[i]=list(set(dx[i]));
We do that in this above code. For each solution, we generate reflections and tag them against that solution, we do the same with rotations. All the while, we remove these transformed versions from those we capture in the unique solution set (variable rr). When the loop completes, for each unique solution in rr, we have it as a key in dictionary dx with a list of reflections and rotations that correspond to it. Cool, huh?
With that, we are done. We have all the 12 unique solutions the Wikipedia article lists. We can verify this easily.
The Genetic Algorithms Approach
Now let's look at solving the problem with genetic algorithms (GA). This post is already too long, so I won't explain all the details of genetic algorithms here, but will present the code outlining the main points of the solution. Our goal here is to find at least one solution. To find more solutions, simple run the code for "more generations".
Generally speaking the GA code proceeds in the following steps:
Find a shape for candidate solutions. Generate a few and use them as the "base gene pool". We call these candidate chromosomes.
Define a fitness function to evaluate how close the candidates are to solving the problem. In each "generation" we keep the best candidates, discard the rest.
Define reproduction functions for good candidates so you inch closer to a solution. Reproduction is of two types a. evolution through mutation - take random bits in the candidate and flip them, b. crossover - take two parent chromosomes, generate two progeny from them snipping each parent at the same point, and crossing over the two segments - this is akin to "sexual reproduction".
Only put valid new chromosomes in the pool. At each generation check if you already have a solution using the fitness function. If you do, stop. Otherwise, propagate further.
Simple! Now let's do this in code.
First we define a set of candidates we call chromosomes in the GA context. These are 8-tuples just like in the previous code-base from earlier in the article e.g. [6, 4, 7, 1, 8, 2, 5, 3] is a chromosome.
We define a "fitness function". This tells us how close a particular candidate is to being a solution. The fitness function we use below is one that counts the number of "intersections" (i.e. instances when two queens attack each other in a solution candidate). If this value is 0, we have a solution!
def xsection(v): # fitness function # returns a measure of quality for a configuration pm # 0 metric means v solves the problem r=0; b=pconv(v); for i in b: ds=dchk(i); r+=len([k for k in b if k in ds]); # number of collisions return r;
Next, we define the mutate and crossover functions that simulate the evolution of a candidate, and "sexual reproduction" between two candidate solutions to generate better solutions over time from the gene pool we have. Of course, we pick only the more promising solutions (based on their fitness values) to generate new candidates in future generation. This is akin to Darwin's "survival of the fittest" in action. Along with mutation and crossover, we also add new chromosome material to the gene pool, ensuring that all new chromosomes that are included are valid ones i.e. use digits 1-8 once and only once.
def regen(): # generate new chromosome return random.sample([1,2,3,4,5,6,7,8],8);
def validc(c1): r=[1,2,3,4,5,6,7,8] x=[c1.count(i) for i in r]; if x.count(1)==8: return True; return False;
def xover(c1,c2): # crossover c1 and c2 chromosomes t=[1,2,3,4,5,6,7,8]; k1,k2=random.choices(t,k=8),random.choices(t,k=8); k1,k2=c1[:cpt]+c2[cpt:],c1[cpt:]+c2[:cpt]; while validc(k1)==False or validc(k2)==False: cpt=random.randint(1,8); k1,k2=random.choices(t,k=8),random.choices(t,k=8); k1,k2=c1[:cpt]+c2[cpt:],c1[cpt:]+c2[:cpt]; return (k1,k2);
Now we set up the gene pool defining its size.
pool=[]; pool_sz=5000;
def create_gene_pool(): # create gene pool with 5000 chromosome seeds global pool,pool_sz; while len(pool)<pool_sz: candidate=regen(); if candidate not in pool: pool+=[candidate]; return pool;
We set up variables to detect if a solution has been found, and gather up the solutions as they are found. The rank_pool_chromosomes function ranks each chromosome in the pool using the fitness function so we can keep only the top x%, throw out the rest, and use the retained ones to "breed" candidates for generations to come.
solved=0; solutions=[];
def rank_pool_chromosomes(): global solved, pool, solutions; # rank pool by chromosome quality using fitness function r=[]; for i in pool: r+=[(i,xsection(i))]; r.sort(key=lambda tup: tup[1]); if r[0][1]==0: solved=1; solns=[i for i in r if i[1]==0]; for s in solns: solutions+=[s]; # reduce back to pool format pool=[i[0] for i in r]; return pool;
The nexgen_pool function creates the next generation of the gene pool. The pct input controls how much of the best parts of the current generation we retain, while the mp variable allows us to set the mutation percentage. With 10% retained for new genetic material to be introduced, 1-mp%-10% is the allocation the code makes for new chromosomes generated via the crossover method.
def nexgen_pool(pct,mp): # repopulate pool replacing bottom pct % with better # mp is mutate %, xp is crossover % = 1-mp-0.1 global pool,pool_sz; pool=pool[:int(pct*pool_sz)]; # truncate pool numm=int((pool_sz-len(pool))*mp); # number of new mutate chromosomes numx=pool_sz*0.9-len(pool)-numm; # number of crossover chromosomes if numx%2!=0: numx-=1; numm+=1; new_c=[]; # new chromosomes for i in range(numm): c1=random.sample(pool,1); t1=mutate(c1); if validc(t1) and t1 not in pool: pool+=[mutate(c1)]; for i in range(int(numx/2)): c1,c2=random.sample(pool,2); t1,t2=xover(c1,c2); pool+=[t1,t2]; while len(pool)<pool_sz: t=regen(); if t not in pool: pool+=[t]; return;
Finally, we come to the main body of the code. We create the gene pool first, then until a solution is found, we rank the chromosomes we have, keep the top 20%, mutate them to generate 60% of new chromosomes, with 10% of new ones, and 30% new chromosomes being created using crossover. We keep doing this until we find (a) solution(s).
create_gene_pool(); num_gen=0; while solved==0: print('generation:',num_gen); rank_pool_chromosomes(); nexgen_pool(0.2,0.6); num_gen+=1;
print('solutions are:'); for i in solutions: print(i[0]);
When done, the code prints the below 10 solutions after just a couple of generations. Not bad. Given there is some randomness here, your results may be different. You can check these for correctness against the Wikipedia article.
Prime numbers (positive integers that are only divisible by themselves and 1) are fascinating and have many applications in areas including encryption. In this post, we look at different ways to compute primes and at optimizations we can make to speed up the code as we go. Like true geeks, before we begin, we pay homage to Optimus Prime, the AutoBot from the Transformers that people think first of when anyone says "Prime" - our nod to pop-culture.
Optimus Prime patent: Hiroyuki Obara., Public domain, via Wikimedia Commons
I recently came across this book below, and was (re)fascinated with primes reading some of the articles there. This book was published by the same people behind Quanta magazine. And James Glieck of Chaos fame wrote the foreward. I like his work a lot. Some references:
Solution Approaches & Python Code
Starting out, the below was a naive approach I considered. Factorize every integer and print those as primes that have only itself and 1 as factors.
def factorize(n): r,f,t=[],2,n; # start with 2 as a factor while t>1: while t%f==0: r+=[f]; t/=f; f+=1; return r;
n=100; # search for primes <=100
print([i for i in range(n) if len(factorize(n))==1]);
As you can probably imagine, this will not work for finding larger primes (say till 50 million) in a timely way. The next approach we can consider is the Sieve of Eratosthenes. The way this works is by writing out all the numbers starting with 2 to the top end of your range, marking them all as prime to start with. Then you take the first prime candidate, mark it as a confirmed prime and mark all of its multiples in the range as non-primes. You continue this till there are no multiples left to mark as non-primes, and you're left with just the primes. The link provided animates this process, so you can see visually how this works. This is truly an ancient method (from ancient Greece as the name indicates), and can be sped up with computers. However, you still need to create a large array (e.g. you are mining for all primes less than 1 billion), so some optimizations are called for to squeeze out efficiencies in code.
An approach called the "Segmented Sieve of Eratosthenes" is used, which is illustrated in the code below. There are also several additional optimizations done in this code, which are explained in what follows. Note however that the code below can be further improved by using NumPy arrays if in python, or by re-writing it entirely in C, a compiled language. I leave the code below in its raw form with simple data structures to make the logic easier to follow.
def extend(n):
v=[2,3,5];
u=v[:];
for lc in range(n):
sv=v[-1];
tn=sv*sv;
if sv>250000: tn=round(sv+1000000,-6);
print(u,len(v),tn);
#k=input();
u=[i for i in range(sv,tn,2) if (i+1)/6==int((i+1)/6) or (i-1)/6==int((i-1)/6)];
ul=round(u[-1]**0.5);
ul=max([i for i in v if i<=ul])
for j in v[1:v.index(ul)+1]: u=[i for i in u if i%j!=0];
if u[0] in v: v+=u[1:]; # avoid repeats across range extensions
else: v+=u;
return v;
Let's examine this line by line. I first seed the code with a vector of the first 3 primes called v. Then, I use the loop counter lc which iterates the indicated n number of times through the logic. Each time, the code finds primes using the Sieve function between the largest prime captured so far (v[-1]), and the top end of the examined range being either the square of the largest prime found so far or the largest prime found so far plus 1 million, rounded to the nearest million. We do this to keep the range reasonable. Then, within this range to be examined, we generate all odd numbers that can be expressed in the form (6n+1) or (6n-1) where n is an integer - this is because all primes MUST satisfy this condition - these are the prime number candidates in that range. Finally, we check if each number is divisible by any of the primes we already found, and those that aren't divisible by any we have found so far are primes we add to our list. Very intuitive, really.
Optimizations used:
Let's discuss the (6n+1) or (6n-1) constraint. If a number is of the form 6n, it is divisible by 6 (duh!), also 6n+2 and 6n+4 are divisible by 2, and 6n+3 is divisible by 3. This leaves 6n+1 and 6n+5 (or 6n-1) as the only numbers in modulo 6 arithmetic that are possible prime candidates. We use that idea here as an optimization.
If we consider the Sieve, each prime p is essential to eliminate multiples starting at p**2. We use this idea in the two lines that start with ul in the code above. Let's illustrate this by means of an example. Consider 5. Multiples of 5 are 10, 15, 20, 25, 30, 35, .... Now, all numbers below 25 (5**2) are divisible by other primes smaller than 5 e.g. 10=2*5, 15=3*5, 20=2*2*5. But 25 can only be "disqualified" using 5 as a divisor. What this means is that when checking for divisors of a number N, we only need to check all the found primes from 2 till sqrt(N).
Can we do better with the same data structures before we convert to using NumPy arrays? Certainly. We can pickle the prime number list into a file, and load and restart the search grid for new range extensions as we like, at later times, then pickle and save the data structure again. This is done with the code below. The first block to pickle the data structure into a file, and a second to load the data structure into the program.
import pickle; # pickle to save the data structure
g=open('primes_20211013.pkl','wb');
pickle.dump(p,g);
g.close();
import pickle; # pickle to
f=open('primes_20211013.pkl','rb');
x=pickle.load(f);
f.close();
With this, we have to update our extend function from the above to extend2 as below, so it loads the data from the pickled object for the next set of extensions. So you can mine one range in one day, and continue mining the next. Of course, things will go blazingly fast(er) if you convert data structures to use NumPy arrays instead if the code is still to be in python, or better yet if you use C.
def extend2(n): # grows prime set read from pickled data
f=open('primes_20211013.pkl','rb');
v=pickle.load(f);
f.close();
u=v[:];
for lc in range(n):
sv=v[-1];
tn=sv*sv;
if sv>250000: tn=round(sv+1000000,-6);
print(u,len(v),tn);
#k=input();
u=[i for i in range(sv,tn,2) if (i+1)/6==int((i+1)/6) or (i-1)/6==int((i-1)/6)];
ul=round(u[-1]**0.5);
ul=max([i for i in v if i<=ul])
for j in v[1:v.index(ul)+1]: u=[i for i in u if i%j!=0];
if u[0] in v: v+=u[1:];
else: v+=u;
return v;
With this, let's look at a couple of videos where people explore some of these concepts in code.
In addition to deterministic methods, there are also probabilistic methods for prime number detection. A popular one is called the Hawkins Sieve. There are others too. Some interesting papers [1], [2]. and StackOverflow questions [3][4]. As indicated, for larger ranges, the Atkins Sieve may be more efficient.
Have you tried other techniques to generate primes that are more efficient? Please share in the comments.
In this blog post, we discuss the book, The Nine Pitfalls of Data Science by Gary Smith and Jay Cordes.
In short, the 9 pitfalls covered in the book are:
Using Bad Data - you have to have the right data to answer any question, and then use have to use the data right. Stanford's Hastie and Tibshirani, in their amazing introductory statistical learning course take pains to point out how to structure and run through a data study. It always starts with validating the data set you have to see a. if it is the right data, b. if the data is right, c. if it can be fixed for use to answer the question at hand. Incidentally, if you are interested in their work, I cannot recommend their excellent book on Statistical Learning enough - they even made it available for free download via their website!
Putting Data Before Theory - The authors describe the Texas Sharpshooter Fallacy - either "p-hacking" to ex-post find conclusions the data supports, or defining expected experimental outcomes after the experiment is complete. They liken this to a sharpshooter either a. picking bulls-eye targets he shot after all shots are fired, or b. painting a bulls-eye around a bullet hole for a shot already fired. This is perhaps the most important pitfall described in the book, and all data scientists should take careful note of this. Marcos Lopez de Prado, financial data scientist extraordinaire who at the time of this writing works at ADIA (Abu Dhabi Sovereign Wealth Fund) makes this point quite forcefully with excellent examples in his book as well - not an easy book, but required reading for serious data science professionals.
Worshipping Math - the most successful data scientists are not those that can build the most sophisticated models, but those that are able to reason from first principles and build explainable models that capture the spirit of the data-set, and deliver quality insights on time. In fact, some of the best data scientists use the simplest models that can generalize well. Data Science done right is also an art form. To be a competent data scientist, you must know the math. But it is equally important to have an abundance of common sense.
Worshipping Computers - they say more data beats a better algorithm some of the time. There are books that explore how best to leverage computers for statistical inference, and how the field of inferential statistics has evolved as computer power has grown by leaps and bounds in recent years with the wider deployment and use of GPUs and more sophisticated compute infrastructure to solve targeted or niche problems. However, good data scientists understand that compute power doesn't substitute for being able to really understand and leverage data to generate insights.
Torturing Data - Andrew Lo, highly respected professor of quantitative finance at MIT's Sloan School of Management is known to have said "Torture data long enough, and it will confess to anything". This is a common problem one sees in nascent or budding data practices in some firms, where management presses their staff to improve result accuracy by a certain % given a particular data set. The managers do not know how data science works, do not know what quality of signal the data set affords, but press their staff threatening poor reviews unless the managers' requirements are met. Good data scientists will exit such firms as soon as possible. Part of a data scientist's job is to educate senior managers and executives, so they know what is possible, and the firm doesn't become a victim of the data science hype cycle.
Fooling Yourself: "... and you are the easiest person to fool." One issue is that people doing the analysis really really want to believe they have generated meaningful results. Some less scrupulous ones might even fudge things at the edges, or might be pressured by their managers to. "After all, what difference can a small change in the analysis make? Turns out, sometimes this might have far reaching consequences.
Confusing Correlation with Causation - this item and the next one are somewhat more technical. Human nature likes to believe that it can construct plausible explanations when it sees patterns in data. Some of the patterns may be fleeting, some may be unrelated coincidental occurrences, while other phenomena may not have a causal relationship between them at all but be caused by a third hidden phenomenon. Assuming a relationship exists where it doesn't actually isn't just of less use, it can also hurt when the expected behavior does not repeat itself. An excellent example of this kind of situation is illustrated in the book below.
Being Surprised by Regression Towards the Mean - an oft-told tale is about fighter pilots in training. Those that did poorly in a session and got yelled at did better in the next one, while those that excelled and got praised did less well in the next session. Did the praise or admonition really make the difference? Or was it simply that the sessions' poor performance and out-performance were deviations from the norm, and things reverted to the norm quickly in the sessions after? Data Scientists should keep this idea in mind.
Doing Harm - doctors take the Hippocratic Oath which states "first, do no harm" as they practice medicine. While there isn't an equivalent oath for data scientists, true professionals would do well to keep that in mind as they embark on solving data problems for the Enterprise, particularly where the consequences of errors can be large or devastating for employees, customers, shareholders, or other stakeholders. Some years ago, there was talk of having a similar oath being administered to MBA students graduating from top schools. Not sure where that initiative stands today.
What I liked about this book is that the arguments are structured and presented in a clear and compelling way. Being a seasoned data practitioner and senior data scientist myself, I have seen, and can appreciate the need for a book like this, particularly to aid the unwary or the neophytes of potential issues that lurk in the dark. In addition, reading this book would also serve as a refresher to seasoned data professionals about things they knew and perhaps forgot as they moved through the mists of time.
Having said that, I believe some of the examples they used could have been better chosen, structured or presented to make their points more forcefully. This isn't an academic tome. It is a highly readable treatise, so more examples presented more lucidly would only add to the quality of the presentation and contribute to greater reader impact. In that sense perhaps an improved second edition would add more value. But... Definitely worth the relatively easy read nonetheless.
There are also lists of much more technical aspects in data science that merit careful consideration by practitioners in the field. One such talk by Mark Landry from H20.ai is linked below.
Recently I have been reading a few books from Manning Publications - for those that don't know, Manning produces books with artistic renderings of people on their covers, just like O'Reilly publishes books with animals on top.
Two that caught my eye were these:
First off, when reading any book, the reader has to go in with the right expectations. Neither of these is written with a strong technical focus - on AI or on data science. Rather, they are geared towards managers that need to come up to speed to put in place an optimal structure to enable projects in these areas to be successful. (Not that management is easy by any means, it's just that the skillset needed there is different from that needed to be a competent data scientist.)
What I particularly liked about the first book is that it provides a nice framework to evaluate if your AI-based approach towards solving a business problem is going to a. meet the business needs, and b. deliver sufficient value given the costs that will be incurred. Different books discuss different ways of setting up an AI/ML/DS pipeline, so I am not wedded to a particular model there, but the framework for doing a cost/benefit analysis presented here is quite interesting.
The second book above explains at length how people at different levels in the organization can contribute to the success of a data science project or practice - whether they are leading functions, departments, projects, or tasks.
These are very easy reads without too much technical content. Some might argue that portions of these books would apply to other projects in the workplace as well, not just AI/ML/DS ones, and that would be justified.
One quibble I have regarding these books is the cost. For the value they give you, I think ~$40/book is really quite steep. And I don't think these quite rate as "classics" (not yet, anyway) that I'd want to keep handy on my bookshelf. Perhaps make some notes on the models and frameworks, but that's about it. So yes, these are worth reading - will perhaps take a couple of hours each, but I'd get them from the library. Incidentally, Manning books are (sometimes significantly) cheaper when purchased directly from their webstore (Bing it), than from Amazon.
Another Manning book that I read was this one:
This is quite technical, but as you dig into it, there is a lot of knowledge, and even some wisdom you can get from it. It starts by giving examples of microservices from some application contexts, and then delves quite deeply into microservice patterns that are quite well explained with helpful diagrams. It also does a very good job describing layering - where you want to solve business problems, where to address the technical issues, how to minimize technical debt, evolve architectures, the relative strengths and weaknesses of various programming and computing paradigms, etc. While not perfect, I'd say it is extremely well done, and highly recommended reading if you need to learn more about microservices.
One book I am looking forward to is this one:
Luis Serrano, the author, has an excellent YouTube channel where he explains complex machine learning algorithms in very simple terms so anyone can understand them. If he carries forward that ability into his book (I haven't read it, but I hope he has), the book should be a pleasure to read.
Speaking of YouTube channels for machine learning, the below two do a good job explaining algorithms and applications. Perhaps not as flamboyant as the Siraj Rawal tutorials, but very good in content nonetheless. See for example RitwickMath's video on Markov Chain Monte Carlo, and Luis Serrano's video on Generative Adversarial Networks. Excellent work!
RitwikMath and Luis Serrano YouTube videos for Machine Learning algorithms
What does it take to be a leader? What principles do you follow? Can leadership be taught? These are all oft-asked questions. I recently had the pleasure of reading what two coaches had to say about this. I read management experts all the time, so this is a bit of a departure from reading what (say) Peter Drucker had to say. Both coaches were legends in their own right - John Wooden who led the UCLA team to several championship wins, and Lou Holtz - renowned college football coach. The books here:
I wasn't really ever great at sports, though I enjoyed and enjoy playing for recreation. However, the stories these coaches tell, their unrelenting focus on excellence, the impact they had on young impressionable atheletes, and their principled approach to success is something we can all learn from.
The journey is better than the inn. -- Miguel de Cervantes
They both come from very humble beginnings. They did not particularly seek fame, and at least in Wooden's case, expressed at least a minor dislike at how it cramped his style. The advice they share is timeless: focus on the fundamentals - hone your craft and sweat the details, work hard - the harder you work the luckier you will get, focus on your team not yourself, walk your talk, and finally, concentrate on the process and the results will take care of themselves.
They express these in terms of getting team victories, and how you can grow into a leader by practicing these consistently with honesty, dedication, and focus, so people will want you to lead them, will willingly follow, and together you will be able to realize your dream goal. There is of course more to these books (e.g. the "success pyramid", etc.) , but I'm not going for a full point-by-point summary here.
copyright-free image from Pixabay.com
Victory here is measured by how close one comes to delivering their peak performance - not whether you have the highest score on the board at the end of the game, or even whether you won or lost. If you do your absolute best when called on, you have won. Great sentiment to live by.
The Vince Lombardi books and quotes are quite well-known. I've read them and they are inspirational. But the above books are also good if in a somewhat more understated way.
The difference between a successful person and others is not a lack of strength, not a lack of knowledge, but rather a lack of will. -- Vince Lombardi
For inspiration and motivation, this is one of my favorite talks - by Arnold Schwarznegger. It's 12 minutes well spent.
Recently, a friend pointed me to this poem about the need to take some pains in life to grow and develop yourself so you can see success. It's called "Good Timber" by Douglas Malloch. My favorite stanza:
Good timber does not grow with ease: The stronger wind, the stronger trees; The further sky, the greater length; The more the storm, the more the strength. By sun and cold, by rain and snow, In trees and men good timbers grow.
-- excerpt from "Good Timber" by Douglas Malloch
What are some other good leadership books written by sports titans that you like? Please share in the comments.
The term "Lean Startup" was popularized by Eric Reis in his popular book of the same name. In it, he talks about ideas to setup, build, and grow a startup while leveraging lean principles - high levels of efficiency with minimal waste. In this post, we explore some of these ideas in some detail, and I also point you to relevant YouTube videos for those of you interested in learning more.
Key Principles
The Build-Measure-Learn feedback loop - everything you do must be with a view to first formulating, and then learning about your domain or environment by first defining what you want to measure to test your hypotheses, building things out, measuring the quantities of interest, and learning from your experiments. If you fail, you fail early and move forward to conduct new and different experiments.
Everything is a grand experiment - build a series of MVPs - Minimum Viable Products - anything that doesn't lead to validated learning is considered a waste.
He also defines different types of MVPs:
Video MVPs - some startups had great success setting up a website with fake videos illustrating not yet available functionality to get people to sign up early. This tests marketplace demand before too much potentially wasteful expenditure is incurred in building out features.
Concierge MVPs - focus on a few customers at a time, build your product to satisfy them, then move on to other, newer, adjacent customer segments, conquering them one at a time.
Wizard of Oz MVPs - sign up customers, and do manual work initially "behind the curtain", delight users, and keep automating as traffic volumes pick up.
Photo by Mikael Blomkvist from Pexels
He explains that there are different types of Growth Engines that target different attributes of the business - you have to pick one (and only one) that represents what is most important to you at a given time in your company's lifecycle, though you may change Engines when the situation calls for you to do so:
Stickiness focused - you expect clients to stay with you long term e.g. as subscribers. Here the metrics you track are things like Customer Acquisition rate and Churn rate. Some businesses where this kind of engine makes sense are firms like Gillette, and the HP Printers business.
Virality Engine - you want more and more people to sign up for your company's products or services, measuring things like viral coefficient compounded if possible with Word Of Mouth recommendations. Social Networks apply this paradigm all the time.
Paid Engine - use advertising to get customers. Measure things like cost per acquisition, lifetime customer value, and related metrics. Widely used in the retail ecommerce domain.
Pivot or Persevere? Typically startups begin with an end in mind - a target audience, a target market, a geography they are going after, etc. They build a baseline MVP, try and tune it towards the ideal state, and hope for the best. But sometimes it is prudent to change direction to either refine or fine-tune, product, market, customer segment, etc. to get more revenue to stay viable. So they have to decide - stay the course, or pivot to try something related to grow better as a firm? He explains different kinds of pivots startups go through:
Customer Segment Pivot - when a change in customer segment is called for, to reach higher revenue gains. .
Value Capture Pivot - when you change how you intend to monetize the product you plan to sell. You have to hit revenue targets while protecting your margins after all.
Growth Engine Pivot - this happens when you change your core growth focus, looking at the options in 4.1-4.3 above.
Here are some additional resources for those interested to explore the topic further. Enjoy!
The first video above is a quick summary of the key takeaways from the book, while the second is a talk the author gave at Google. Both make for interesting watching, and are time well-spent to learn more about the lean startup methodology. Most importantly, in startups as in life, it matters that we focus our energies to do what really counts.
"The worst thing in the world is to do something well that need not have been done at all" -- Peter Drucker (celebrated Management Guru)