Filling a hard drive

filling_hd_pic

For a while I had wondered if I could fill a hard drive with prime numbers. Turns out the answer is yes.

Of course the answer is yes, since the hard drive doesn’t care what’s saved on it. One could easily fill a hard drive with videos or just pictures of their dog. But at the start I had nothing to generate prime numbers. And what I could find just spit out primes without a care about where they went. So I needed to make something that could handle them such that they were stored.

I could’ve just made a simple script that writes the primes to text files but that would’ve been a mess. No, I needed something that could store them in a folder structure that made it easy to look through. It additionally needed to name the files and folders correctly and be able to restart wherever it left off should it stop for any reason.

Some of the previous pieces of code I’ve shown here were used in this project. It’s gone through several iterations and I’ll likely add to it after this post as well. The first version used a fast prime number generator I had found online but when I timed it and extrapolated that, my estimate for filling a hard drive would take well over a month to complete.

A prime number is a number that is only divisible by 1 and itself. There’s no method that can just spit out the Nth prime number, so you need to calculate all the previous ones before you can get to the Nth prime. There’s some ways around that but I’m not getting into it.

After the first version I started looking for faster methods. I started adding wheels to indefinite sieves but that still wasn’t fast enough. Eventually I wrapped my code around primesieve which can generate primes far faster than any Python code I could find or come up with. Python doesn’t really compile its code to a really fast form whereas the primesieve is compiled. Details to explain differences in programming languages isn’t going to happen here.

Anyway, 200 billion primes took around 2 days and 4 hours to generate and stored in text files takes up 2,777,781,707,715 bytes (2.53 terabytes). From this I figure it would take 13 – 15 TB to store a trillion primes and around 11 days to generate. If the files were compressed and their file size reduced by at least 80% then it could be possible to store a trillion prime numbers on a 2 or 3 TB hard drive. My estimates for this put it around 1.17 TB – 1.33 TB using 7zip. Oh yeah the 200 billionth prime is 5,665,449,960,167 – 5.6 trillion.

You could write your compression method to push it even smaller– first find the differences between the primes, then compress just those. That can push it down to under one terabyte for storing a trillion primes. But then you have a compression method that only works for text files filled with prime numbers and no one would have access to them because it’s a custom format that only you know how to deal with. On the other hand it would let you store a trillion primes per terabyte. Following this logic of a trillion primes per TB you’d need some 416 thousand terabytes (416 petabytes) to store all the primes that primesieve is capable of generating. Or ≈7.5 exabytes uncompressed.

import os
try:
    import lzma as xz
except ImportError:
    import pylzma as xz

def compress(filename):
    'compression method for files containing integer sequences'
    with open(filename)as src:lines=src.readlines()
    fst,nums=lines[0],[]#first line has any info about the file, not numbers
    for ln in lines[1:]:nums.extend(int(n)for n in ln.split())

    diff=[str(nums[0])]#determine differences
    for k,m in enumerate(nums[1:],1):diff.append(str(m-nums[k-1]))

    gn=os.path.splitext(filename)[0]+'.knf'#krushed number sequence file
    with open(gn,'wb')as out:out.write(xz.compress(fst+'\n'.join(diff)))
    st=os.stat(filename);os.utime(gn,(st.st_atime,st.st_mtime))
    os.remove(filename);return gn

def decompress(filename):
    'decompresses files produced by the compress function'
    with open(filename,'rb')as src:dat=xz.decompress(src.read())
    dat=dat.split('\n');fst,nums,cn=dat[0]+'\n',[],0
    for n in dat[1:]:cn+=int(n);nums.append(str(cn))

    tmp=[nums[i:i+10]for i in range(0,len(nums),10)]
    lines=[fst]+['\t'.join(k)+'\n'for k in tmp]

    gn=os.path.splitext(filename)[0]+'.txt'
    with open(gn,'w')as dst:dst.writelines(lines)
    st=os.stat(filename);os.utime(gn,(st.st_atime,st.st_mtime))
    os.remove(filename);return gn

 

So below is the prime generator and storage code I’ve written. I haven’t tested it on Linux or Unix or Raspbian but it should work if you have all your write permissions, paths set correctly and the shell=True is added in on line 218 or so (hopefully). Requires the primesieve program as mentioned above. Some things will need to be set first before you run it, they’re at the top just inside the class. This will normally just be the folder to put all the primes and the primesieve path. Initialize it and then call savePrimes with the number of million primes to generate. So savePrimes(12) will generate 12 million primes and then stop. Calling it again will start generating again where it last left off. Currently it’s set to stop generating after 200 billion primes but that is also changeable. This isn’t likely the last version of it as I’m already planning to generate those trillion primes…

from datetime import datetime
from itertools import count
from math import log, ceil
import subprocess as sp
import sys, os, pickle

class PrimeGen(object):
    """Automated prime number generator (via primesieve) and saver
    from Orthallelous"""
    PrimeFolder='P:\Prime Numbers'#where all files generated will be saved
    
    #primesieve console application path, get it from http://primesieve.org/
    External_Sieve = '\path\to\primesieve'

    #do not change these once the project has started!
    _logName='Generator.log'
    _fmt='%Y-%m-%d %H:%M:%S.%f'
    _stop=5665449960168#when project should end, default is 5665449960168 (enough for 200 billion primes)
    #18446744030759878627#largest prime that primesieve can generate
    #primesieve's limit is 2**64-((2**32)*10)
    _Print=True#whether print to stdout or not (log file is always written to)
    
    def Ord(self, n):
        'returns the ordinal of n\nSource: http://stackoverflow.com/a/16671271'
        Ord='{}{}'.format(n,'th'if 9<n%100<14 else{1:'st',2:'nd',3:'rd'}.get(n%10,'th'))
        return Ord

    def Chunk(self, l, n):
        'breaks a list l into chunks of size n\nSource: http://stackoverflow.com/a/1751478'
        return[l[i:i+n]for i in range(0,len(l),n)]
    
    def RoundDown(self, x, p):
        'Rounds x down to the nearest p\nSource: http://stackoverflow.com/a/8866125'
        return x if not x%p else x-x%p

    def Approx(self, n, st=False):
        'approximates n - rounds up to nearest power of ten\nst is used only in .stats()'
        N={0:'',3:'thousand',6:'million',9:'billion',12:'trillion',15:'quadrillion',
        18:'quintillion',21:'sextillion',24:'septillion',27:'octillion',30:'nonillion',
        33:'decillion',36:'undecillion',39:'duodecillion'};G={0:'',1:'ten ',2:'hundred '}
        L=int(ceil(log(n,10)));g=G.get(L%3,'');L-=L%3
        ovfl=''if L '1 minute, 15 seconds'"
        if sec==0:return'no elasped time'
        S,Q,c,t=[],abs(sec),0,''
        N=['second','minute','hour','day','week']
        for j in[60,60,24,7]:Q,r=divmod(Q,j);S.append(r)
        S.append(Q)
        if st8 and t[0]in'0:':t=t[1:]
            return t
        for T in S:
            if T==0:c+=1;continue
            v=str(int(T))
            if c==0 and int(T)!=T:v=str(round(T,3))
            n=N[c];c+=1
            if T!=1 and not st:n+='s'
            if st:t=v+n[0]+' '+t
            else:t=v+' '+n+', '+t
        return t.rstrip(', ')
        
    def makeFolder(self, CurrentMillion):
        'returns the name and folder for the current millionth primes'
        c=C=abs(CurrentMillion)

        NestDepth=5# initial folder depth to nest (note: will auto nest further if current million grows past a power of ten)
        leng=4# digit length for the range names
        p=Q=1# one-tens-hundreds place, millions place
        
        prms='Primes'
        SNam='{} {} {}'#single name (current number, number name, primes)
        RNam='{{:0{0}d}} - {{:0{0}d}} {{}}th {{}}'.format(leng)#range name (rangelow, rangehigh, number name, primes)
        Name=Fold=''

        C+=(1+pow(10,NestDepth))#correction factor (for auto nesting further)
        while C:
            C//=10#whittle current million down to zero

            x=self.RoundDown(c,p*Q)#round current million down to nearest current power
            NumNam=self.Approx(Q*1e6)#current number name

            if p==1:N=SNam.format(self.Ord(x//Q+1),NumNam,prms)#single name
            else:N=RNam.format(x//Q+1,x//Q+p,NumNam,prms)#range name

            if p==1 and Q==1:Name=N#bottom level, file name
            else:Fold=os.path.join(N,Fold)#all other levels, build folder names

            p*=10
            if p>100:Q*=p;p=1#reached a new number name, adjust the millions place and reset the one-tens-hundreds place

        Fold=os.path.join(self.PrimeFolder,Fold)
        return Fold, Name

    def writePrimes(self, Primes, CurrentMillion):
        'Writes a list of primes to a text file'
        if not Primes:return#empty list

        #generate folder path and file name
        Path,Name=self.makeFolder(CurrentMillion)
        if not os.path.exists(Path):os.makedirs(Path)
        File=os.path.join(Path, Name+'.txt')
        Txt=open(File,'w')

        #write
        Txt.write('Generated at {:{}}\n'.format(datetime.now(),self._fmt))
        cnks=self.Chunk(Primes,10)
        lines = ['\t'.join(k)+'\n'for k in cnks]
        Txt.writelines(lines)
        Txt.close()

    def DumpInternals(self, data):
        'Dumps the internal data of the prime generator to a file'
        # or the endogenous data! a new word for the day!
        #data is [total running million, where to start generator]
        
        C=data[0]-1#get current millionth prime
        if not C:return# nothing to save
        File=os.path.join(self.PrimeFolder,'Internals.dmp')
        
        Fyle=open(File,'wb')
        pickle.dump(data,Fyle)
        Fyle.close()

    def LoadInternals(self):
        'loads most recent internal data file - returns True if successful'
        try:#to find the internal dmp file
            File=os.path.join(self.PrimeFolder,'Internals.dmp')

            Fyle=open(File,'rb')
            intrns=pickle.load(Fyle)
            Fyle.close()

        except:#no internal data found, check backup method
            intrns = self.seeLast()

        return intrns
    
    def seeLast(self, fld=PrimeFolder, lvl=0):
        'sees which was most recent millionth prime file written'
        flds=[i for i in os.listdir(fld)if os.path.isdir(os.path.join(fld,i))]

        if flds:#another folder level
            mil, lst = self.seeLast(os.path.join(fld, sorted(flds)[-1]),lvl+1)
        else:#bottom folder level, find most recent file
            fls=[i for i in os.listdir(fld)if os.path.isfile(os.path.join(fld,i))]
            fls=sorted([i for i in fls if 'Primes.txt' in i])

            if not fls:return 0, 0#no files
            else:#valid file
                mil=fls[-1].split(' ')[0]
                mil=int(mil.rstrip('stndrh'))
                fl =os.path.join(fld,fls[-1]) 

                fil=open(fl, 'rb');fil.seek(-2, 2)
                while fil.read(1)!='\n':fil.seek(-2, 1)
                lst=fil.readline();fil.close()
                lst = int(lst.split()[-1])
                
                return mil, lst+1
       
        return mil, lst

    def savePrimes(self, MaxMil):
        '''generates and saves prime numbers
        MaxMil is the number of million primes to generate and save.'''
        date='{:{}}'.format(datetime.now(),self._fmt)
        if not os.path.exists(self.PrimeFolder):os.makedirs(self.PrimeFolder)

        STrT = 0#check if project had started
        if not os.path.exists(os.path.join(self.PrimeFolder, self._logName)):
            STrT=1
        else:#log file exists, check is project already ended
            l=open(os.path.join(self.PrimeFolder, self._logName),'r')
            l.seek(-4,2)
            while l.read(1)!=b'\n':l.seek(-2,1)
            end=l.readline()
            l.close();ed='  * * *  Project end'
            if ed in end:
                end=end.replace(ed,'')
                print('Project ended on {}'.format(end))
                return
        
        #start logging
        Log=open(os.path.join(self.PrimeFolder, self._logName),'a',1)
        if STrT:
            Log.write(date + '* * *'.center(9)+'Project start\n')

        #log codes:
        #  o    started generator
        #  x    stopped generator
        #  |    Nth million primes saved
        #  -    resuming generator
        # ***   project start/end

        #definitions
        date='{:{}}'.format(datetime.now(),self._fmt)
        run_mil, start=self.LoadInternals()
        cur_mil=0#current total of millions for this run
        stop=self._stop
        
        if run_mil:
            line='\n'+date+'-'.center(9)+'Resuming Generator'
            Log.write(line+'\n')
            if self._Print:print(line)
        else:
            line=date+'o'.center(9)+'Starting Generator'
            Log.write(line+'\n')
            if self._Print:print(line)


        command =[self.External_Sieve, str(start), str(stop), '--print']
        generator = sp.Popen(' '.join(command), #shell=True,
                         bufsize=1,
                         stdout=sp.PIPE)

        primes=[]
        with generator.stdout:#http://stackoverflow.com/a/17698359
            for line in iter(generator.stdout.readline, ''):
                
                p = line.strip()
                
                primes.append(p)

                #store primes in blocks of a million
                if len(primes)>=1e6:

                    #log date and which million primes were written
                    genTime='{:{}}'.format(datetime.now(),self._fmt)
                    Title=self.Ord(run_mil+1)+' million primes'
                    DateHead=genTime+('|'.center(9))
                    line=DateHead+Title
                    Log.write(line+'\n')
                    if self._Print:print(line)

                    #save current block of primes
                    self.writePrimes(primes,run_mil)

                    #bookkeeping
                    run_mil+=1
                    cur_mil+=1
                    recentP=primes[-1]
                    primes=[]

                    #check to see if should stop
                    if cur_mil>=MaxMil:
                        generator.terminate()
                        generator.kill()
                        break
     
        #outside with generator block
        generator.wait()

        self.writePrimes(primes,run_mil)#this only happens at the upper limit of the generator

        #save where one stopped
        startPt = int(recentP)+1#where to start the prime generator for next time
        self.DumpInternals([run_mil, startPt])
        
        #close
        date='{:{}}'.format(datetime.now(),self._fmt)
        line=date+'x'.center(9)+'Stopping Generator'
        Log.write(line+'\n')
        if self._Print:print(line)
        
        if stop//int(1e6)< recentP:#project has reaching finishing point
            date='{:{}}'.format(datetime.now(),self._fmt)
            Log.write(date + '* * *'.center(9)+'Project end\n')
        Log.close()

    def getPrime(self, n):
        'returns the Nth prime number (if it was generated beforehand)'
        if n1:reptime.append(tt)

        numreps=len(repeats)-len(set(repeats))

        #print out the stats---------------------------------------------------
        amt=repeats[-1]
        
        print('Have generated {} million primes so far'.format(amt))
        if amt>=1000:
            ammt=amt*1000000
            c=int(round(log(ammt,1000),7))-1;a=(ammt//1000**(c))/1000.
            readable='{} {}'.format(int(a),self.Approx(ammt,1))
            print('              ({})'.format(readable))
        
        print('\nFirst million time: '+self.DeltaT(ones[0]))
        minMil, maxMil = min(ones), max(ones)
        minOrd,maxOrd=self.Ord(ones.index(minMil)),self.Ord(ones.index(maxMil))
        fil = max(len(minOrd), len(maxOrd))
        print('Shortest million time ({:>{}}): '.format(minOrd, fil)+
            self.DeltaT(minMil))
        print('Longest  million time ({:>{}}): '.format(maxOrd, fil)+
            self.DeltaT(maxMil))
        print('Last  million time: '+self.DeltaT(ones[-1]))

        #for i,t in enumerate(ones,1):#times of every million
        #    print(self.Ord(i)+' million primes: '+self.DeltaT(t))

        groups=[1];Gnums=4#extendable grouping system
        groups.extend([i*10**j for i in[10,20,50]for j in range(Gnums)])
        groups.sort();groups.append(10**(Gnums+1))

        print('\nTimes of last Nth million and last ten average:')
        for N in groups:

            if N==1:Nth=ones
            else:Nth=[sum(i)for i in self.Chunk(ones,N)if len(i)==N]
    
            avg = Nth[-10:]
            if len(avg):
                line='{:>6}: {:=10:line+=' ({})'.format(self.DeltaT(sum(avg)/float(len(avg)),-1))
                print(line)

        if On:
            print('\nTotal on time : '+self.DeltaT(sum(On)))
            if len(On)>1:
                print('    longest  run: '+self.DeltaT(max(On), -1))
                print('    shortest run: '+self.DeltaT(min(On), -1))
                print('    most  recent: '+self.DeltaT(On[-1], -1))
        if Off:
            print('\nTotal off time: '+self.DeltaT(sum(Off)))
            if len(Off)>1:
                print('    longest  run: '+self.DeltaT(max(Off), -1))
                print('    shortest run: '+self.DeltaT(min(Off), -1))
                print('    most  recent: '+self.DeltaT(Off[-1], -1))
        if numreps:
            print('\nRepeated prime generations: '+str(numreps))
            print('Total repeated generation time: '+self.DeltaT(sum(reptime)))

        if pro:
            st=pro[0].split('  ')[0]
            
            print('\nproject began on: '+st)
            stt = datetime.strptime(st, self._fmt)
            if len(pro)==1:
                print('    {} ago'.format(self.DeltaT((datetime.now()-stt).total_seconds(),1)))

            if len(pro)>1:
                sp=pro[1].split('  ')[0]
                spt = datetime.strptime(sp, self._fmt)
                print('Total project time: {}'.format(self.DeltaT((spt-stt).total_seconds(),-1)))
                print('Project ended on: '+sp)
                print('    {} ago'.format(self.DeltaT((datetime.now()-spt).total_seconds(),1)))
        
        print('\ntimes started: '+str(starts+resumes))
        print('times stopped: '+str(stops))
        #print('times resumed: '+str(resumes))
        statT=(datetime.now()-Tst).total_seconds()
        if statT>5:print('\nanalysis time: {}'.format(self.DeltaT(statT, 1)))
        return

 

Tagged with:
Posted in Primes, Python

Leave a comment

In Archive
Design a site like this with WordPress.com
Get started