Wednesday, October 13, 2021

2021 | Primes!

Prime numbers (positive integers that are only divisible by themselves and 1) are fascinating and have many applications in areas including encryption. In this post, we look at different ways to compute primes and at optimizations we can make to speed up the code as we go. Like true geeks, before we begin, we pay homage to Optimus Prime, the AutoBot from the Transformers that people think first of when anyone says "Prime" - our nod to pop-culture.

Optimus Prime patent: Hiroyuki Obara., Public domain, via Wikimedia Commons

I recently came across this book below, and was (re)fascinated with primes reading some of the articles there. This book was published by the same people behind Quanta magazine. And James Glieck of Chaos fame wrote the foreward. I like his work a lot. Some references: 

 

Solution Approaches & Python Code

Starting out, the below was a naive approach I considered. Factorize every integer and print those as primes that have only itself and 1 as factors. 

def factorize(n): 
 r,f,t=[],2,n; # start with 2 as a factor
 while t>1: 
  while t%f==0: 
   r+=[f];
   t/=f;
  f+=1;
 return r;

n=100; # search for primes <=100
print([i for i in range(n) if len(factorize(n))==1]);

As you can probably imagine, this will not work for finding larger primes (say till 50 million) in a timely way. The next approach we can consider is the Sieve of Eratosthenes. The way this works is by writing out all the numbers starting with 2 to the top end of your range, marking them all as prime to start with. Then you take the first prime candidate, mark it as a confirmed prime and mark all of its multiples in the range as non-primes. You continue this till there are no multiples left to mark as non-primes, and you're left with just the primes. The link provided animates this process, so you can see visually how this works. This is truly an ancient method (from ancient Greece as the name indicates), and can be sped up with computers. However, you still need to create a large array (e.g. you are mining for all primes less than 1 billion), so some optimizations are called for to squeeze out efficiencies in code.


An approach called the "Segmented Sieve of Eratosthenes" is used, which is illustrated in the code below. There are also several additional optimizations done in this code, which are explained in what follows. Note however that the code below can be further improved by using NumPy arrays if in python, or by re-writing it entirely in C, a compiled language. I leave the code below in its raw form with simple data structures to make the logic easier to follow.

def extend(n):
 v=[2,3,5];
 u=v[:];
 for lc in range(n):
  sv=v[-1];
  tn=sv*sv;
  if sv>250000: tn=round(sv+1000000,-6);
  print(u,len(v),tn);
  #k=input();
  u=[i for i in range(sv,tn,2) if (i+1)/6==int((i+1)/6) or (i-1)/6==int((i-1)/6)];
  ul=round(u[-1]**0.5);
  ul=max([i for i in v if i<=ul])
  for j in v[1:v.index(ul)+1]: u=[i for i in u if i%j!=0];
  if u[0] in v: v+=u[1:]; # avoid repeats across range extensions
  else: v+=u;
 return v;

Let's examine this line by line. I first seed the code with a vector of the first 3 primes called v.  Then, I use the loop counter lc which iterates the indicated n number of times through the logic. Each time, the code finds primes using the Sieve function between the largest prime captured so far (v[-1]), and the top end of the examined range being either the square of the largest prime found so far or the largest prime found so far plus 1 million, rounded to the nearest million. We do this to keep the range reasonable. Then, within this range to be examined, we generate all odd numbers that can be expressed in the form (6n+1) or (6n-1) where n is an integer - this is because all primes MUST satisfy this condition - these are the prime number candidates in that range. Finally, we check if each number is divisible by any of the primes we already found, and those that aren't divisible by any we have found so far are primes we add to our list. Very intuitive, really. 

Optimizations used: 
  1. Let's discuss the (6n+1) or (6n-1) constraint. If a number is of the form 6n, it is divisible by 6 (duh!), also 6n+2 and 6n+4 are divisible by 2, and 6n+3 is divisible by 3. This leaves 6n+1 and 6n+5 (or 6n-1) as the only numbers in modulo 6 arithmetic that are possible prime candidates. We use that idea here as an optimization.
  2. If we consider the Sieve, each prime p is essential to eliminate multiples starting at p**2. We use this idea in the two lines that start with ul in the code above. Let's illustrate this by means of an example. Consider 5. Multiples of 5 are 10, 15, 20, 25, 30, 35, .... Now, all numbers below 25 (5**2) are divisible by other primes smaller than 5 e.g. 10=2*5, 15=3*5, 20=2*2*5. But 25 can only be "disqualified" using 5 as a divisor. What this means is that when checking for divisors of a number N, we only need to check all the found primes from 2 till sqrt(N).
Can we do better with the same data structures before we convert to using NumPy arrays? Certainly. We can pickle the prime number list into a file, and load and restart the search grid for new range extensions as we like, at later times, then pickle and save the data structure again. This is done with the code below. The first block to pickle the data structure into a file, and a second to load the data structure into the program. 

import pickle; # pickle to save the data structure
g=open('primes_20211013.pkl','wb');
pickle.dump(p,g);
g.close();

import pickle; # pickle to 
f=open('primes_20211013.pkl','rb');
x=pickle.load(f);
f.close();

With this, we have to update our extend function from the above to extend2 as below, so it loads the data from the pickled object for the next set of extensions. So you can mine one range in one day, and continue mining the next. Of course, things will go blazingly fast(er) if you convert data structures to use NumPy arrays instead if the code is still to be in python, or better yet if you use C.

def extend2(n): # grows prime set read from pickled data
 f=open('primes_20211013.pkl','rb'); 
 v=pickle.load(f);
 f.close();
 u=v[:];
 for lc in range(n):
  sv=v[-1];
  tn=sv*sv;
  if sv>250000: tn=round(sv+1000000,-6);
  print(u,len(v),tn);
  #k=input();
  u=[i for i in range(sv,tn,2) if (i+1)/6==int((i+1)/6) or (i-1)/6==int((i-1)/6)];
  ul=round(u[-1]**0.5);
  ul=max([i for i in v if i<=ul])
  for j in v[1:v.index(ul)+1]: u=[i for i in u if i%j!=0];
  if u[0] in v: v+=u[1:];
  else: v+=u;
 return v;

With this, let's look at a couple of videos where people explore some of these concepts in code.

    

In addition to deterministic methods, there are also probabilistic methods for prime number detection. A popular one is called the Hawkins Sieve. There are others too. Some interesting papers [1], [2]. and StackOverflow questions [3][4]. As indicated, for larger ranges, the Atkins Sieve may be more efficient.

Have you tried other techniques to generate primes that are more efficient? Please share in the comments.

No comments:

Post a Comment