Medusa's Cave: 2021

Wednesday, November 10, 2021

2021 | A Couple of Coding Interview Questions

In this post we explore a couple more simple coding interview questions. These were recently asked of 4th grade math students to be solved with paper and pencil, so should be fair game for job interviews. My approach below is not the most optimal for any of the questions, but the solution is easily coded in 3-5 min. You should of course try to improve on these. And solving these analytically with paper and pencil is "left as an exercise to the reader".

Question 1: using the digits from 1 through 9 only once each, find two numbers such that the second is twice the first.

My approach here starts with common sense that I then codify into the program. Given the digits 1 to 9 only, we can split this into a 4 digit number (say x) and a 5 digit number (y say), with the 5 digit number being double the 4 digit number. Before reading further, convince yourself that the leftmost digit of y has to be a 1 for the math to work. The code below works regardless of whether you figure this out or not, but setting the parameters with this understanding simplifies the calculations and makes your code more efficient. Here it is:

import itertools; # useful in solving combinatorics problems like this one

def nrep(y): # returns True if no digit repeats in y
t=str(y);
for i in t:
if t.count(i)>1: return False;
return True;

def xdig(x,y): # returns True if there is no digit overlap between x and y
return len(list(set(str(x)).intersection(set(str(y)))))==0;

def dchk(x,n): # returns True if n is a digit of x
return str(n) in str(x);

def nstr(a,b): # returns a string of int a followed by int b
return str(a)+str(b);

def alldigs(N): # returns True if N has all digits from 1 to 9 inclusive
N=str(N);
for i in range(1,10):
if N.count(str(i))!=1: return False;
return True;

digs=[2,3,4,5,6,7,8,9]; # can try this with 1 included, will yield same answer

x=itertools.permutations(digs,4); # generate all permutations
r=[]; # list will hold answers when done
for i in x:
n=i[0]*1000+i[1]*100+i[2]*10+i[3]; # construct 4 digit number from digit permutation
y=2*n; # y is 2 times x
if nrep(y) and xdig(n,y) and not dchk(y,0) and alldigs(nstr(n,y)): r+=[(n,y)];

There are 12 answers to this problem. The number pairs that work for x and y are: [(6729, 13458), (6792, 13584), (6927, 13854), (7269, 14538), (7293, 14586), (7329, 14658), (7692, 15384), (7923, 15846), (7932, 15864), (9267, 18534), (9273, 18546), (9327, 18654)]

Question 2: The digits from 0 through 9 inclusive should show up in the product below only once. What are the numbers that satisfy this equation? (a, b, c, d, e, f, g, h) are all digits.

abc x de = fg0h8

To solve this problem, we reuse some of the functions already defined above, but add a couple more to help with our work. To solve this problem, we call the number with digits abc as x and that with digits de as y. Code and answers follow below.

def dig(N,n):
# returns the nth leftmost digit of N; n=-1 gives rightmost digit
return int(str(N)[n]);

x=[i for i in range(100,1000) if nrep(i)];
y=[i for i in range(10,100) if nrep(i)];
z=[]; # collect answers here
for i in x:
for j in y:
t=i*j;
if len(str(t))==5 and dig(t,2)==0 and dig(t,-1)==8 and xdig(t,i) and xdig(t,j) and xdig(i,j) and nrep(t):

z+=[(i,j,t)]; # z holds answers!

Z is: [(297, 54, 16038), (594, 27, 16038)] - there are 2 possible solutions.

Photo by Miguel Á. Padriñán from Pexels

Why is it important and useful to learn to solve these kinds of problems in python?

With the pandemic, interviewing and hiring looks more at a global talent pool. I regularly interview candidates remotely. How then can I ask them to code something in a coding test or otherwise, and get a quick sense of whether they are able to code correctly let's say even if I am not able to see their code as they are coding it, but just at the end? Well, problems like these where there is a definite correct answer that can be verified, and where the answer cannot be easily googled (or "binged" if that is also a verb), are useful in such situations. I use these kinds of questions regularly these days.

Another couple of interesting questions, left to the reader:

True or False: the 30th leftmost digit of the 3000th Fibonacci number is prime. [The Fibonacci series starts 0, 1, 1, 2, 3, ... and each successive number is the sum of the previous two series elements.] This one is trivial for python enthusiasts.
(***) What positive number doubles when its last digit becomes its first digit? Very challenging problem! See how Presh Talwalker solves it here. For those that don't know, his "Mind Your Decisions" YouTube channel has some excellent math puzzles with solutions that help sharpen your thinking.

Please share other interesting puzzles you've come across in the comments. Thanks for reading.

Thursday, October 14, 2021

2021 | Solving the 8 Queens Problem

The Queen (sometimes also called The Minister) is the most versatile piece on the chess board. Queens can move multiple steps diagonally like camels or bishops and can move along straight lines like rooks or elephants. The only thing they cannot do is jump over obstacles like the knight or horse pieces.

The 8-Queens Puzzle asks that we place 8 such queen pieces on standard 8x8 chessboard so that no queens attack each other. Sounds easier than it is, so give this a try. The Wikipedia article presents the problem, its history, and the set of solutions that work. In this post, we study how we might solve this problem on a computer.

One solution i.e. one configuration that meets the specification to this problem is below. The article also tells us that there are a total of 92 possible solutions that are each derived from a set of 12 "base solutions" through operations like reflection and rotation. Here's also a video that discusses the problem in more detail.

Representation

We see that any valid solution must have at most one queen in every row and column. So, minimally before we check the diagonal constraints, if we were to even construct solution configurations, they would be 8 tuples, or lists of 8 elements, with digits from 1 to 8 (given the 8x8 size of the chessboard), with no repeating digits. So the above chessboard would be represented as [5, 3, 1, 7, 2, 8, 6, 4] i.e. column 1 has a queen at the fifth row from the bottom, column 2 with one at the 3rd row from the bottom, etc.

Invariants

Rotation and reflection do not change the validity of a valid solution. To see this, imagine the above chess board transformed into one represented as [4, 6, 8, 2, 7, 1, 3, 5]. This is merely a reflection of the above board along a horizontal axis. The solution remains valid.

Next consider the chessboard represented by turning the above board 90 degrees in the clockwise direction. We end up with [6, 4, 7, 1, 8, 2, 5, 3]. Also a valid configuration of pieces for the 8 queens problem.

Of course, other transformed configurations of the pictured board are possible, by rotating it 180 degrees, 270 degrees, and by reflecting about the other (i.e. vertical) axis as well.

We solve this two ways - first, through using traditional computing, and second using genetic algorithms. In the first case, we solve the problem to find the full set of 92 possible configurations, then "whittle away the shadows" to find the core set of 12 solutions. With genetic algorithms, we outline how this works with code, then look at at least one solution the program generates. Let's get started.

Types of Approaches

There are different ways one might solve this problem. One way would be to generate all possible configurations with queens placed on different rows and columns, and then check which of these is valid. Given the search space is relatively small i.e. 8*7*6*5*4*3*2 = 40320 possibilities (convince yourself you understand why this is so before reading further), we can easily examine the full space to find possible solutions. After finding the full set of solutions, we can eliminate reflections and rotations to be left with the "core set of configurations of interest".

A second traditional computing approach might be to start by placing the first queen in one of the rows on column 1, then progressively place new queens on as-yet-unused rows in columns to come, all while checking the new configuration is such that they do not yet attack each other. If they do, then try other possible rows in that column, before backtracking to an earlier column to move that queen to a different row, then move forward again trying to fit the next queen, and so on till you find a configuration that solves the problem. This method uses branch-and-bound and backtracking ideas from computer science. But again, given the search space is so small to examine with the computer, we just use the approach from the previous paragraph and ignore this one.

A third approach we can try is one that uses Genetic Algorithms. We explain this in detail in a later section.

Traditional Search-Based Solution

import itertools;
s=[1,2,3,4,5,6,7,8];

x=itertools.permutations(s,8);
t=[];
for i in x: t+=[i];

First, we use the itertools module and generate all possible unique permutations of the digits 1 through 8. These are the configurations we test.

def pconv(pm):
u=[];
for i in range(1,9): u+=[(i,pm[i-1])];
return u;

def punconv(u): return tuple([i[1] for i in u]);

Then we define two helper functions. pconv converts a board from [6, 4, 7, 1, 8, 2, 5, 3] to [(1,6), (2,4),(3,7), (4,1), (5,8), (6,2), (7,5), (8,3)] - this helps with computations, since each position is reduced to a set of coordinates in the x-y grid. punconv performs the reverse transformation so we can still print things out compactly.

def dchk(j): #cth col has queen in row n
c,n=j[0],j[1];
v=[]
k1,k2=n,n;
for i in range(c-1,0,-1):
k1-=1;
if k1>0: v+=[(i,k1)];
k2+=1;
if k2<9: v+=[(i,k2)];
k1,k2=n,n;
for i in range(c+1,9,1):
k1-=1;
if k1>0: v+=[(i,k1)];
k2+=1;
if k2<9: v+=[(i,k2)];
return v;

The very important function dchk takes as input a 2-tuple (c,n) which indicates a queen is placed in column c in row n, and computes all other cells on the board the queen attacks diagonally (horizontal and vertical attacks are already accounted for by not allowing any 2 queens from being in the same row across columns in our setup). So, for example, if a queen is placed at (5,4), then dchk would return a list of positions including (4,3), (4, 5), (3, 2), (3, 6), ... to the left, and (6, 5), (6, 3), (7, 6), (7,2), ... to the right. We can then check if these cells are occupied by other queens, and if so, disqualify a solution candidate.

def rotate(v): # rotated board configurations
N=8;
r=[];
b=pconv(v);
b1=b[:];
u=[];
while u!=b:
u=[];
for i in b1: u+=[(i[1],N-i[0]+1)];
u.sort(key=lambda tup: tup[0]);
#print(u);
b1=u[:];
r+=[punconv(u[:])];
return r[:-1];

def reflect(v): # reflected board configurations
N=8;
b=pconv(v);
m=[];
for i in b: m+=[(i[0],8-i[1]+1)];
b.reverse();
return [tuple(punconv(m)),tuple(punconv(b))];

These two functions are the rotation and reflection transforms discussed earlier, presented above in code form. Pay particular attention to the rotate function. This is an oft-asked programming question in job interviews. It can be mildly challenging to write correctly under pressure.

Now, with all the pieces in place, the next steps are relatively easy. We check each of the candidates generated from permutations of the set [1, 2, 3, 4, 5, 6, 7, 8] to see if there are collisions (i.e. pieces attack each other), and if not, we have a solution!

u=[];
for i in t:
f=0;
x=pconv(i);
for j in x:
v=dchk(j);
if len(list(set(v).intersection(set(x))))!=0:
f=1;
break;
if f==0: u+=[i];

Here, we take each candidate generated using itertools earlier, then check if any of the diagonal attack positions from dchk for each queen on the board are occupied by other queens. If not, we have a solution! So with this loop complete, u is a list of possible solutions.We can verify, there are 92 when this part of the code completes. This agrees with what the Wikipedia article says. But wait, before popping the champagne, we need to eliminate duplicates - the reflections and rotations we spoke of earlier!

v=u[:];
rr=[];
dx={};
for i in u:
r=reflect(i);
if r[0] in v: v.remove(r[0]);
if r[1] in v: v.remove(r[1]);
if i in v:
dx[i]=[];
dx[i]+=[r[0],r[1]];
rr+=[i];
x=rotate(i);
dx[i]+=x;
for k in x:
   if k in v: v.remove(k);
   r=reflect(k);
   if r[0] in v: v.remove(r[0]);
   if r[1] in v: v.remove(r[1]);
   dx[i]+=[r[0],r[1]];
dx[i]=list(set(dx[i]));

We do that in this above code. For each solution, we generate reflections and tag them against that solution, we do the same with rotations. All the while, we remove these transformed versions from those we capture in the unique solution set (variable rr). When the loop completes, for each unique solution in rr, we have it as a key in dictionary dx with a list of reflections and rotations that correspond to it. Cool, huh?

With that, we are done. We have all the 12 unique solutions the Wikipedia article lists. We can verify this easily.

The Genetic Algorithms Approach

Now let's look at solving the problem with genetic algorithms (GA). This post is already too long, so I won't explain all the details of genetic algorithms here, but will present the code outlining the main points of the solution. Our goal here is to find at least one solution. To find more solutions, simple run the code for "more generations".

Generally speaking the GA code proceeds in the following steps:

Find a shape for candidate solutions. Generate a few and use them as the "base gene pool". We call these candidate chromosomes.
Define a fitness function to evaluate how close the candidates are to solving the problem. In each "generation" we keep the best candidates, discard the rest.
Define reproduction functions for good candidates so you inch closer to a solution. Reproduction is of two types a. evolution through mutation - take random bits in the candidate and flip them, b. crossover - take two parent chromosomes, generate two progeny from them snipping each parent at the same point, and crossing over the two segments - this is akin to "sexual reproduction".
Only put valid new chromosomes in the pool. At each generation check if you already have a solution using the fitness function. If you do, stop. Otherwise, propagate further.

Simple! Now let's do this in code.

First we define a set of candidates we call chromosomes in the GA context. These are 8-tuples just like in the previous code-base from earlier in the article e.g. [6, 4, 7, 1, 8, 2, 5, 3] is a chromosome.

We define a "fitness function". This tells us how close a particular candidate is to being a solution. The fitness function we use below is one that counts the number of "intersections" (i.e. instances when two queens attack each other in a solution candidate). If this value is 0, we have a solution!

def xsection(v): # fitness function
# returns a measure of quality for a configuration pm
# 0 metric means v solves the problem
r=0;
b=pconv(v);
for i in b:
ds=dchk(i);
r+=len([k for k in b if k in ds]); # number of collisions
return r;

Next, we define the mutate and crossover functions that simulate the evolution of a candidate, and "sexual reproduction" between two candidate solutions to generate better solutions over time from the gene pool we have. Of course, we pick only the more promising solutions (based on their fitness values) to generate new candidates in future generation. This is akin to Darwin's "survival of the fittest" in action. Along with mutation and crossover, we also add new chromosome material to the gene pool, ensuring that all new chromosomes that are included are valid ones i.e. use digits 1-8 once and only once.

def mutate(c1): # mutate the chromosome
n=random.sample([1,2,3,4,5,6,7,8],4);
t1,t2=c1[n[0]],c1[n[2]];
c1[n[0]]=c1[n[1]];
c1[n[2]]=c1[n[3]];
c1[n[1]]=t1;
c1[n[3]]=t2;
return c1;

def regen(): # generate new chromosome
return random.sample([1,2,3,4,5,6,7,8],8);

def validc(c1):
r=[1,2,3,4,5,6,7,8]
x=[c1.count(i) for i in r];
if x.count(1)==8: return True;
return False;

def xover(c1,c2): # crossover c1 and c2 chromosomes
t=[1,2,3,4,5,6,7,8];
k1,k2=random.choices(t,k=8),random.choices(t,k=8);
k1,k2=c1[:cpt]+c2[cpt:],c1[cpt:]+c2[:cpt];
while validc(k1)==False or validc(k2)==False:
cpt=random.randint(1,8);
k1,k2=random.choices(t,k=8),random.choices(t,k=8);
k1,k2=c1[:cpt]+c2[cpt:],c1[cpt:]+c2[:cpt];
return (k1,k2);

Now we set up the gene pool defining its size.

pool=[];
pool_sz=5000;

def create_gene_pool():
# create gene pool with 5000 chromosome seeds
global pool,pool_sz;
while len(pool)<pool_sz:
candidate=regen();
if candidate not in pool: pool+=[candidate];
return pool;

We set up variables to detect if a solution has been found, and gather up the solutions as they are found. The rank_pool_chromosomes function ranks each chromosome in the pool using the fitness function so we can keep only the top x%, throw out the rest, and use the retained ones to "breed" candidates for generations to come.

solved=0;
solutions=[];

def rank_pool_chromosomes():
global solved, pool, solutions;
# rank pool by chromosome quality using fitness function
r=[];
for i in pool: r+=[(i,xsection(i))];
r.sort(key=lambda tup: tup[1]);
if r[0][1]==0:
solved=1;
solns=[i for i in r if i[1]==0];
for s in solns: solutions+=[s];
# reduce back to pool format
pool=[i[0] for i in r];
return pool;

The nexgen_pool function creates the next generation of the gene pool. The pct input controls how much of the best parts of the current generation we retain, while the mp variable allows us to set the mutation percentage. With 10% retained for new genetic material to be introduced, 1-mp%-10% is the allocation the code makes for new chromosomes generated via the crossover method.

def nexgen_pool(pct,mp):
# repopulate pool replacing bottom pct % with better
# mp is mutate %, xp is crossover % = 1-mp-0.1
global pool,pool_sz;
pool=pool[:int(pct*pool_sz)]; # truncate pool
numm=int((pool_sz-len(pool))*mp); # number of new mutate chromosomes
numx=pool_sz*0.9-len(pool)-numm; # number of crossover chromosomes
if numx%2!=0:
numx-=1;
numm+=1;
new_c=[]; # new chromosomes
for i in range(numm):
c1=random.sample(pool,1);
t1=mutate(c1);
if validc(t1) and t1 not in pool: pool+=[mutate(c1)];
for i in range(int(numx/2)):
c1,c2=random.sample(pool,2);
t1,t2=xover(c1,c2);
pool+=[t1,t2];
while len(pool)<pool_sz:
t=regen();
if t not in pool: pool+=[t];
return;

Finally, we come to the main body of the code. We create the gene pool first, then until a solution is found, we rank the chromosomes we have, keep the top 20%, mutate them to generate 60% of new chromosomes, with 10% of new ones, and 30% new chromosomes being created using crossover. We keep doing this until we find (a) solution(s).

create_gene_pool();
num_gen=0;
while solved==0:
print('generation:',num_gen);
rank_pool_chromosomes();
nexgen_pool(0.2,0.6);
num_gen+=1;

print('solutions are:');
for i in solutions: print(i[0]);

When done, the code prints the below 10 solutions after just a couple of generations. Not bad. Given there is some randomness here, your results may be different. You can check these for correctness against the Wikipedia article.

[2, 5, 7, 1, 3, 8, 6, 4]
[3, 6, 8, 1, 4, 7, 5, 2]
[5, 7, 2, 4, 8, 1, 3, 6]
[4, 7, 5, 3, 1, 6, 8, 2]
[4, 2, 7, 5, 1, 8, 6, 3]
[6, 4, 7, 1, 8, 2, 5, 3]
[5, 7, 1, 4, 2, 8, 6, 3]
[4, 2, 7, 3, 6, 8, 1, 5]
[2, 4, 6, 8, 3, 1, 7, 5]
[6, 3, 1, 8, 4, 2, 7, 5]

There you have it - the 8 Queens Problem solved two different ways. Some interesting videos on various approaches:

Have you done this other ways I don't discuss here? Please share in the comments.

Wednesday, October 13, 2021

2021 | Primes!

Prime numbers (positive integers that are only divisible by themselves and 1) are fascinating and have many applications in areas including encryption. In this post, we look at different ways to compute primes and at optimizations we can make to speed up the code as we go. Like true geeks, before we begin, we pay homage to Optimus Prime, the AutoBot from the Transformers that people think first of when anyone says "Prime" - our nod to pop-culture.

Optimus Prime patent: Hiroyuki Obara., Public domain, via Wikimedia Commons

I recently came across this book below, and was (re)fascinated with primes reading some of the articles there. This book was published by the same people behind Quanta magazine. And James Glieck of Chaos fame wrote the foreward. I like his work a lot. Some references:

Solution Approaches & Python Code

Starting out, the below was a naive approach I considered. Factorize every integer and print those as primes that have only itself and 1 as factors.

def factorize(n):
r,f,t=[],2,n; # start with 2 as a factor
while t>1:
while t%f==0:
r+=[f];
t/=f;
f+=1;
return r;

n=100; # search for primes <=100

print([i for i in range(n) if len(factorize(n))==1]);

As you can probably imagine, this will not work for finding larger primes (say till 50 million) in a timely way. The next approach we can consider is the Sieve of Eratosthenes. The way this works is by writing out all the numbers starting with 2 to the top end of your range, marking them all as prime to start with. Then you take the first prime candidate, mark it as a confirmed prime and mark all of its multiples in the range as non-primes. You continue this till there are no multiples left to mark as non-primes, and you're left with just the primes. The link provided animates this process, so you can see visually how this works. This is truly an ancient method (from ancient Greece as the name indicates), and can be sped up with computers. However, you still need to create a large array (e.g. you are mining for all primes less than 1 billion), so some optimizations are called for to squeeze out efficiencies in code.

An approach called the "Segmented Sieve of Eratosthenes" is used, which is illustrated in the code below. There are also several additional optimizations done in this code, which are explained in what follows. Note however that the code below can be further improved by using NumPy arrays if in python, or by re-writing it entirely in C, a compiled language. I leave the code below in its raw form with simple data structures to make the logic easier to follow.

def extend(n):

v=[2,3,5];

u=v[:];

for lc in range(n):

sv=v[-1];

tn=sv*sv;

if sv>250000: tn=round(sv+1000000,-6);

print(u,len(v),tn);

#k=input();

u=[i for i in range(sv,tn,2) if (i+1)/6==int((i+1)/6) or (i-1)/6==int((i-1)/6)];

ul=round(u[-1]**0.5);

ul=max([i for i in v if i<=ul])

for j in v[1:v.index(ul)+1]: u=[i for i in u if i%j!=0];

if u[0] in v: v+=u[1:]; # avoid repeats across range extensions

else: v+=u;

return v;

Let's examine this line by line. I first seed the code with a vector of the first 3 primes called v. Then, I use the loop counter lc which iterates the indicated n number of times through the logic. Each time, the code finds primes using the Sieve function between the largest prime captured so far (v[-1]), and the top end of the examined range being either the square of the largest prime found so far or the largest prime found so far plus 1 million, rounded to the nearest million. We do this to keep the range reasonable. Then, within this range to be examined, we generate all odd numbers that can be expressed in the form (6n+1) or (6n-1) where n is an integer - this is because all primes MUST satisfy this condition - these are the prime number candidates in that range. Finally, we check if each number is divisible by any of the primes we already found, and those that aren't divisible by any we have found so far are primes we add to our list. Very intuitive, really.

Optimizations used:

Let's discuss the (6n+1) or (6n-1) constraint. If a number is of the form 6n, it is divisible by 6 (duh!), also 6n+2 and 6n+4 are divisible by 2, and 6n+3 is divisible by 3. This leaves 6n+1 and 6n+5 (or 6n-1) as the only numbers in modulo 6 arithmetic that are possible prime candidates. We use that idea here as an optimization.
If we consider the Sieve, each prime p is essential to eliminate multiples starting at p**2. We use this idea in the two lines that start with ul in the code above. Let's illustrate this by means of an example. Consider 5. Multiples of 5 are 10, 15, 20, 25, 30, 35, .... Now, all numbers below 25 (5**2) are divisible by other primes smaller than 5 e.g. 10=2*5, 15=3*5, 20=2*2*5. But 25 can only be "disqualified" using 5 as a divisor. What this means is that when checking for divisors of a number N, we only need to check all the found primes from 2 till sqrt(N).

Can we do better with the same data structures before we convert to using NumPy arrays? Certainly. We can pickle the prime number list into a file, and load and restart the search grid for new range extensions as we like, at later times, then pickle and save the data structure again. This is done with the code below. The first block to pickle the data structure into a file, and a second to load the data structure into the program.

import pickle; # pickle to save the data structure

g=open('primes_20211013.pkl','wb');

pickle.dump(p,g);

g.close();

import pickle; # pickle to

f=open('primes_20211013.pkl','rb');

x=pickle.load(f);

f.close();

With this, we have to update our extend function from the above to extend2 as below, so it loads the data from the pickled object for the next set of extensions. So you can mine one range in one day, and continue mining the next. Of course, things will go blazingly fast(er) if you convert data structures to use NumPy arrays instead if the code is still to be in python, or better yet if you use C.

def extend2(n): # grows prime set read from pickled data

f=open('primes_20211013.pkl','rb');

v=pickle.load(f);

f.close();

u=v[:];

for lc in range(n):

sv=v[-1];

tn=sv*sv;

if sv>250000: tn=round(sv+1000000,-6);

print(u,len(v),tn);

#k=input();

u=[i for i in range(sv,tn,2) if (i+1)/6==int((i+1)/6) or (i-1)/6==int((i-1)/6)];

ul=round(u[-1]**0.5);

ul=max([i for i in v if i<=ul])

for j in v[1:v.index(ul)+1]: u=[i for i in u if i%j!=0];

if u[0] in v: v+=u[1:];

else: v+=u;

return v;

With this, let's look at a couple of videos where people explore some of these concepts in code.

In addition to deterministic methods, there are also probabilistic methods for prime number detection. A popular one is called the Hawkins Sieve. There are others too. Some interesting papers [1], [2]. and StackOverflow questions [3][4]. As indicated, for larger ranges, the Atkins Sieve may be more efficient.

Have you tried other techniques to generate primes that are more efficient? Please share in the comments.

Friday, October 8, 2021

2021 | The 9 Pitfalls of Data Science

In this blog post, we discuss the book, The Nine Pitfalls of Data Science by Gary Smith and Jay Cordes.

In short, the 9 pitfalls covered in the book are:

Using Bad Data - you have to have the right data to answer any question, and then use have to use the data right. Stanford's Hastie and Tibshirani, in their amazing introductory statistical learning course take pains to point out how to structure and run through a data study. It always starts with validating the data set you have to see a. if it is the right data, b. if the data is right, c. if it can be fixed for use to answer the question at hand. Incidentally, if you are interested in their work, I cannot recommend their excellent book on Statistical Learning enough - they even made it available for free download via their website!
Putting Data Before Theory - The authors describe the Texas Sharpshooter Fallacy - either "p-hacking" to ex-post find conclusions the data supports, or defining expected experimental outcomes after the experiment is complete. They liken this to a sharpshooter either a. picking bulls-eye targets he shot after all shots are fired, or b. painting a bulls-eye around a bullet hole for a shot already fired. This is perhaps the most important pitfall described in the book, and all data scientists should take careful note of this. Marcos Lopez de Prado, financial data scientist extraordinaire who at the time of this writing works at ADIA (Abu Dhabi Sovereign Wealth Fund) makes this point quite forcefully with excellent examples in his book as well - not an easy book, but required reading for serious data science professionals.
Worshipping Math - the most successful data scientists are not those that can build the most sophisticated models, but those that are able to reason from first principles and build explainable models that capture the spirit of the data-set, and deliver quality insights on time. In fact, some of the best data scientists use the simplest models that can generalize well. Data Science done right is also an art form. To be a competent data scientist, you must know the math. But it is equally important to have an abundance of common sense.
Worshipping Computers - they say more data beats a better algorithm some of the time. There are books that explore how best to leverage computers for statistical inference, and how the field of inferential statistics has evolved as computer power has grown by leaps and bounds in recent years with the wider deployment and use of GPUs and more sophisticated compute infrastructure to solve targeted or niche problems. However, good data scientists understand that compute power doesn't substitute for being able to really understand and leverage data to generate insights.
Torturing Data - Andrew Lo, highly respected professor of quantitative finance at MIT's Sloan School of Management is known to have said "Torture data long enough, and it will confess to anything". This is a common problem one sees in nascent or budding data practices in some firms, where management presses their staff to improve result accuracy by a certain % given a particular data set. The managers do not know how data science works, do not know what quality of signal the data set affords, but press their staff threatening poor reviews unless the managers' requirements are met. Good data scientists will exit such firms as soon as possible. Part of a data scientist's job is to educate senior managers and executives, so they know what is possible, and the firm doesn't become a victim of the data science hype cycle.
Fooling Yourself: "... and you are the easiest person to fool." One issue is that people doing the analysis really really want to believe they have generated meaningful results. Some less scrupulous ones might even fudge things at the edges, or might be pressured by their managers to. "After all, what difference can a small change in the analysis make? Turns out, sometimes this might have far reaching consequences.
Confusing Correlation with Causation - this item and the next one are somewhat more technical. Human nature likes to believe that it can construct plausible explanations when it sees patterns in data. Some of the patterns may be fleeting, some may be unrelated coincidental occurrences, while other phenomena may not have a causal relationship between them at all but be caused by a third hidden phenomenon. Assuming a relationship exists where it doesn't actually isn't just of less use, it can also hurt when the expected behavior does not repeat itself. An excellent example of this kind of situation is illustrated in the book below.
Being Surprised by Regression Towards the Mean - an oft-told tale is about fighter pilots in training. Those that did poorly in a session and got yelled at did better in the next one, while those that excelled and got praised did less well in the next session. Did the praise or admonition really make the difference? Or was it simply that the sessions' poor performance and out-performance were deviations from the norm, and things reverted to the norm quickly in the sessions after? Data Scientists should keep this idea in mind.
Doing Harm - doctors take the Hippocratic Oath which states "first, do no harm" as they practice medicine. While there isn't an equivalent oath for data scientists, true professionals would do well to keep that in mind as they embark on solving data problems for the Enterprise, particularly where the consequences of errors can be large or devastating for employees, customers, shareholders, or other stakeholders. Some years ago, there was talk of having a similar oath being administered to MBA students graduating from top schools. Not sure where that initiative stands today.

What I liked about this book is that the arguments are structured and presented in a clear and compelling way. Being a seasoned data practitioner and senior data scientist myself, I have seen, and can appreciate the need for a book like this, particularly to aid the unwary or the neophytes of potential issues that lurk in the dark. In addition, reading this book would also serve as a refresher to seasoned data professionals about things they knew and perhaps forgot as they moved through the mists of time.

Having said that, I believe some of the examples they used could have been better chosen, structured or presented to make their points more forcefully. This isn't an academic tome. It is a highly readable treatise, so more examples presented more lucidly would only add to the quality of the presentation and contribute to greater reader impact. In that sense perhaps an improved second edition would add more value. But... Definitely worth the relatively easy read nonetheless.

There are also lists of much more technical aspects in data science that merit careful consideration by practitioners in the field. One such talk by Mark Landry from H20.ai is linked below.

Wednesday, September 29, 2021

2021 | AI/ML, Data Science, and Microservices book reviews

Recently I have been reading a few books from Manning Publications - for those that don't know, Manning produces books with artistic renderings of people on their covers, just like O'Reilly publishes books with animals on top.

Two that caught my eye were these:

First off, when reading any book, the reader has to go in with the right expectations. Neither of these is written with a strong technical focus - on AI or on data science. Rather, they are geared towards managers that need to come up to speed to put in place an optimal structure to enable projects in these areas to be successful. (Not that management is easy by any means, it's just that the skillset needed there is different from that needed to be a competent data scientist.)

What I particularly liked about the first book is that it provides a nice framework to evaluate if your AI-based approach towards solving a business problem is going to a. meet the business needs, and b. deliver sufficient value given the costs that will be incurred. Different books discuss different ways of setting up an AI/ML/DS pipeline, so I am not wedded to a particular model there, but the framework for doing a cost/benefit analysis presented here is quite interesting.

The second book above explains at length how people at different levels in the organization can contribute to the success of a data science project or practice - whether they are leading functions, departments, projects, or tasks.

These are very easy reads without too much technical content. Some might argue that portions of these books would apply to other projects in the workplace as well, not just AI/ML/DS ones, and that would be justified.

One quibble I have regarding these books is the cost. For the value they give you, I think ~$40/book is really quite steep. And I don't think these quite rate as "classics" (not yet, anyway) that I'd want to keep handy on my bookshelf. Perhaps make some notes on the models and frameworks, but that's about it. So yes, these are worth reading - will perhaps take a couple of hours each, but I'd get them from the library. Incidentally, Manning books are (sometimes significantly) cheaper when purchased directly from their webstore (Bing it), than from Amazon.

Another Manning book that I read was this one:

This is quite technical, but as you dig into it, there is a lot of knowledge, and even some wisdom you can get from it. It starts by giving examples of microservices from some application contexts, and then delves quite deeply into microservice patterns that are quite well explained with helpful diagrams. It also does a very good job describing layering - where you want to solve business problems, where to address the technical issues, how to minimize technical debt, evolve architectures, the relative strengths and weaknesses of various programming and computing paradigms, etc. While not perfect, I'd say it is extremely well done, and highly recommended reading if you need to learn more about microservices.

One book I am looking forward to is this one:

Luis Serrano, the author, has an excellent YouTube channel where he explains complex machine learning algorithms in very simple terms so anyone can understand them. If he carries forward that ability into his book (I haven't read it, but I hope he has), the book should be a pleasure to read.

Speaking of YouTube channels for machine learning, the below two do a good job explaining algorithms and applications. Perhaps not as flamboyant as the Siraj Rawal tutorials, but very good in content nonetheless. See for example RitwickMath's video on Markov Chain Monte Carlo, and Luis Serrano's video on Generative Adversarial Networks. Excellent work!

RitwikMath and Luis Serrano YouTube videos for Machine Learning algorithms

Monday, September 13, 2021

2021 | Leadership - lessons from Wooden and Holtz

Photo by Miguel Á. Padriñán from Pexels

What does it take to be a leader? What principles do you follow? Can leadership be taught? These are all oft-asked questions. I recently had the pleasure of reading what two coaches had to say about this. I read management experts all the time, so this is a bit of a departure from reading what (say) Peter Drucker had to say. Both coaches were legends in their own right - John Wooden who led the UCLA team to several championship wins, and Lou Holtz - renowned college football coach. The books here:

I wasn't really ever great at sports, though I enjoyed and enjoy playing for recreation. However, the stories these coaches tell, their unrelenting focus on excellence, the impact they had on young impressionable atheletes, and their principled approach to success is something we can all learn from.

The journey is better than the inn. -- Miguel de Cervantes

They both come from very humble beginnings. They did not particularly seek fame, and at least in Wooden's case, expressed at least a minor dislike at how it cramped his style. The advice they share is timeless: focus on the fundamentals - hone your craft and sweat the details, work hard - the harder you work the luckier you will get, focus on your team not yourself, walk your talk, and finally, concentrate on the process and the results will take care of themselves.

They express these in terms of getting team victories, and how you can grow into a leader by practicing these consistently with honesty, dedication, and focus, so people will want you to lead them, will willingly follow, and together you will be able to realize your dream goal. There is of course more to these books (e.g. the "success pyramid", etc.) , but I'm not going for a full point-by-point summary here.

Victory here is measured by how close one comes to delivering their peak performance - not whether you have the highest score on the board at the end of the game, or even whether you won or lost. If you do your absolute best when called on, you have won. Great sentiment to live by.

The Vince Lombardi books and quotes are quite well-known. I've read them and they are inspirational. But the above books are also good if in a somewhat more understated way.

The difference between a successful person and others is not a lack of strength, not a lack of knowledge, but rather a lack of will. -- Vince Lombardi

For inspiration and motivation, this is one of my favorite talks - by Arnold Schwarznegger. It's 12 minutes well spent.

Recently, a friend pointed me to this poem about the need to take some pains in life to grow and develop yourself so you can see success. It's called "Good Timber" by Douglas Malloch. My favorite stanza:

Good timber does not grow with ease:
The stronger wind, the stronger trees;
The further sky, the greater length;
The more the storm, the more the strength.
By sun and cold, by rain and snow,
In trees and men good timbers grow.

-- excerpt from "Good Timber" by Douglas Malloch

What are some other good leadership books written by sports titans that you like? Please share in the comments.

2021 | The Lean Startup

Photo by Anastasia Shuraeva from Pexels

The term "Lean Startup" was popularized by Eric Reis in his popular book of the same name. In it, he talks about ideas to setup, build, and grow a startup while leveraging lean principles - high levels of efficiency with minimal waste. In this post, we explore some of these ideas in some detail, and I also point you to relevant YouTube videos for those of you interested in learning more.

Key Principles

The Build-Measure-Learn feedback loop - everything you do must be with a view to first formulating, and then learning about your domain or environment by first defining what you want to measure to test your hypotheses, building things out, measuring the quantities of interest, and learning from your experiments. If you fail, you fail early and move forward to conduct new and different experiments.
Everything is a grand experiment - build a series of MVPs - Minimum Viable Products - anything that doesn't lead to validated learning is considered a waste.
He also defines different types of MVPs:

Video MVPs - some startups had great success setting up a website with fake videos illustrating not yet available functionality to get people to sign up early. This tests marketplace demand before too much potentially wasteful expenditure is incurred in building out features.
Concierge MVPs - focus on a few customers at a time, build your product to satisfy them, then move on to other, newer, adjacent customer segments, conquering them one at a time.
Wizard of Oz MVPs - sign up customers, and do manual work initially "behind the curtain", delight users, and keep automating as traffic volumes pick up.

Photo by Mikael Blomkvist from Pexels
He explains that there are different types of Growth Engines that target different attributes of the business - you have to pick one (and only one) that represents what is most important to you at a given time in your company's lifecycle, though you may change Engines when the situation calls for you to do so:

Stickiness focused - you expect clients to stay with you long term e.g. as subscribers. Here the metrics you track are things like Customer Acquisition rate and Churn rate. Some businesses where this kind of engine makes sense are firms like Gillette, and the HP Printers business.
Virality Engine - you want more and more people to sign up for your company's products or services, measuring things like viral coefficient compounded if possible with Word Of Mouth recommendations. Social Networks apply this paradigm all the time.
Paid Engine - use advertising to get customers. Measure things like cost per acquisition, lifetime customer value, and related metrics. Widely used in the retail ecommerce domain.

Pivot or Persevere? Typically startups begin with an end in mind - a target audience, a target market, a geography they are going after, etc. They build a baseline MVP, try and tune it towards the ideal state, and hope for the best. But sometimes it is prudent to change direction to either refine or fine-tune, product, market, customer segment, etc. to get more revenue to stay viable. So they have to decide - stay the course, or pivot to try something related to grow better as a firm? He explains different kinds of pivots startups go through:

Customer Segment Pivot - when a change in customer segment is called for, to reach higher revenue gains. .
Value Capture Pivot - when you change how you intend to monetize the product you plan to sell. You have to hit revenue targets while protecting your margins after all.
Growth Engine Pivot - this happens when you change your core growth focus, looking at the options in 4.1-4.3 above.

Here are some additional resources for those interested to explore the topic further. Enjoy!

The first video above is a quick summary of the key takeaways from the book, while the second is a talk the author gave at Google. Both make for interesting watching, and are time well-spent to learn more about the lean startup methodology. Most importantly, in startups as in life, it matters that we focus our energies to do what really counts.

"The worst thing in the world is to do something well that need not have been done at all" -- Peter Drucker (celebrated Management Guru)

Sunday, September 12, 2021

2021 | Thinking Design Thinking

Design Thinking is now central to the way various companies build out new capabilities into their business. There is a wonderful free course on this topic offered by the Stanford d.school.

If you're interested in learning design thinking, you should probably also consider taking this other course that is equally useful. Introduction to Innovation and Entrepreneurship

Some good videos (more on these below) - the first explains design thinking in some detail, while the second is Bill Burnett from the famous Stanford d.school explaining how one can use design thinking principles to design oneself a better life. The focus is firmly on the user. For designing a life, the user is yourself.

Steps in the Design Thinking Process:

Accept: "gravity problems" - just like we have to accept gravity (we cannot really argue with it), if there are no solutions to some problems, we just have to accept the situation or move away from them to change the context
Empathize: (life) find the why behind your goals; (projects) find out what the users really want from the product you are developing
Define: Sketch the details of what you see would be in the ideal final state of what you are trying to build (projects, or your life).
Ideate: (projects) make a list of best and worst ideas - you want to know what you want to include and just as importantly, what you want to exclude from your design. Prioritize these ideas and see how you might be able to incorporate them. (life) Design 3 parallel lives with the base case being your current life trajectory, then define aspects of each of these in detail.
Prototype: (life) Borrow from the parallel lives to improve or enrich your current pathway if you can. (projects) build out the PoCs or MVPs to include the design features from your prioritized list.
Test - (projects) try out how things work in small steps, then iterate. (life) try out things, but commit to changes you've decided to make, close the door to the past so you focus on the future - you can create more options looking forward if you have to. Iterate on an ongoing basis to design yourself a better life.

Bill Burnett from the Stanford d.school gave an excellent TED talk (above) about "How to Design an Excellent Life" - worth the 20 or so minutes. Some good insights there applying the 5 principles of Design Thinking, which he prefaces with an important idea of "Accepting" - some problems have no solutions, you just have to accept the situation. (These are the items listed with the 'life' qualifier for each item in the list above). He also wrote these books that make for interesting reading:

Also interesting is this talk with Tom Kelley, Partner at IDEO - a firm that made Design Thinking mainstream. A bit long, but the video is definitely worth a watch.

Tok Kelley has also written the below books. These are important and give you a structured sense of what innovation really is, the art and science of how to build innovative solutions that will really move the needle on your business.

Tuesday, August 24, 2021

2021 | More Personal Improvement, Up-skilling

Boaler, a colleage and friend of Carol Dweck (of Growth Mindset fame), is a professor of mathematics education at Stanford University. In this book she teaches people how to think about Math education, and about approaching problem solving in general, positively without limiting oneself. I found this quite eye-opening, somewhat comforting, and honestly quite inspirational.

The author of this book discusses a framework he has used to improve his life situation. 7 brief but pertinent questions asked, and decisions made helped him move from being a homeless teenager to a better station in life. The prose is honest, and resonates with the reader. I liked this book, learned from it, and think it could benefit others as well.

This is a beautiful and very powerful book with stories on neuroplasticity or the ability of the brain to evolve to 'fix itself'. The stories of people with various brain-related issues and how they manage to recover are all very touching, and give you hope that you too can evolve your way to a brighter future no matter what troubles you might be facing.

Jack Welch, erstwhile CEO of General Electric, the company founded by none other than Thomas Alva Edison (of light-bulb invention fame, though two others also invented it at about the same time, but were less successful commercializing it), along with his wife Suzy Welch present some MBA-style lessons on areas ranging from strategy to finance, coloring these with real life examples. I really enjoyed this book. There are videos embedded in the digital version which can add to learning enjoyment for the discerning student. Recommended.

Stuart Diamond knocks this one out of the park. I've taken negotiation courses in B-school, and have read and practiced the skill extensively in real life. But his methodology and the way he teaches it is absolutely phenomenal. The examples are nicely presented, and the website that goes with this book (you can search for it online) showcases the model brilliantly. Like he says, (my paraphrase) like it or not, all of us negotiate for something every day of our lives. If we know how to do this well, we and those around us will have better lives. Must read.

Monday, August 23, 2021

2021 | On Quants and Data Scientists in Finance

Over the years, I have often wondered what the difference is between quants and data scientists. These are both popular roles in Financial Institutions of various kinds.

I've worked in both roles at large global banks, as a quant analyst on a FICC desk on the buy side, and as Head of Data Science doing global securities research on the sell side. I've also spoken with a number of smart people in Finance about the difference, and usually the answer is that there isn't too much to distinguish the two roles.

Similarities

The similarities are obvious:

Both roles require a deep understanding of advanced mathematics, statistics, finance, etc.
Both require an ability to code, and code well, potentially in Python, R, C/C++/C#, maybe Julia
Both require that you are able to model things to reflect real world realities with good fidelity
Both require that you are able to articulate and explain things to non-quantitative audiences well
Both have some of the hardest interviews to get in

But what about the differences?

As a data scientist, I've given talks at the Quant Forum (gathering of all quants) at a bank on data science methods, and it was the greatest fun I had giving a presentation. I felt right at home there. And received several collaboration requests from that session. So the overlap cannot be understated. But these two roles are different, and the differences only became clear to me as I read this book:

Dave Epstein discusses the central thesis of his book in this video

The Concept of Range, and the Different Types of Domains

In this book, the author explains that there are some "kind" domains that allow for deep specialization, where experience helps you make better decisions and more effectively deliver value. The more time you spend in the field, the better you get at it.

There are also "unkind" or "wild" domains where the opposite holds true. In some of these, having more experience might give you more confidence, but unless you are humble, leveraging old learnings in new problem contexts can lead to more errors i.e. experience doesn't really help in problem solving.

He illustrates this idea with several examples - from sports using golf and Tiger Woods as an example of a "kind" domain, and with tennis and Roger Federer as an illustration of the "wild" one. He further underscores his point with Chess - grandmasters can reproduce a board with several pieces exactly with only 3 seconds to see it IF the pieces are placed such that the scenario can occur in a real chess game. However, if random pieces are randomly placed, they cannot replicate the board unless given significantly more time to study it, and in the latter case total chess novices can do a better job.

Another example he cites is Music. Did you know that Yo-Yo Ma, the world-renowned cellist tried several other instruments before settling on the cello? This works because music is a "wild" field - you can pick up broad patterns that might work generally, but then again, might not. Then when you specialize in a narrow "kind"-er domain (that of playing the cello in this case), you try those patterns to go deep and excel, keeping what works and discarding the rest.

So how is all this relevant?

As a data scientist, even if you pick just one field, say finance, you develop and use lateral skills a lot. You may work with consumer spend data and basket analysis for retail, with Fed Beige Book data for global macro, with satellite data for real estate or oil/energy markets, with news data for other verticals, etc. You have a wide and deep knowledge, but you pick up sector knowledge as you go, working closely with specialists.

As a quant, you (usually) work on one desk, and specialize in one or two markets. You learn all you can possibly learn about those markets, and go as deep as you possibly can to identify, isolate (where possible), examine, model, and forecast particular variables and the relationships between them. You are a specialist in your sector.

A data scientist may not need stochastics, but should have the intellectual capacity to learn it. Most quants should know enough stochastics to be able to do their jobs well.

In Conclusion...

Both the quant as well as the data scientist roles in finance offer stimulating, intellectually challenging work. I guess in the end, your ability to make a strong positive contribution comes down to your preparation, expectations, firm culture, and the competence and quality of the team you work with.

How does this post match with your views? Please share in the comments below.