Ilya Zakharevich on Thu, 12 Sep 2024 15:04:47 +0200 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [pre-announcement of a PATCH 2.16.1-alpha] 90x speedup of forprime() above 2^64 |
All the timing below is with low end of mid tier CPU: 13th Gen Intel(R) Core(TM) i5-13500T SUMMARY: sieving turns out to be much quicker than (even) I thought. On Thu, Aug 29, 2024 at 11:13:57PM -0700, Ilya Zakharevich wrote: > With this patch, I can test where is the REAL crossover (since, > INDEED, eventually nextprime() IS going to become quicker!). Well, this turned to be very short-sighted to write! > ANSWER: on my computer, I can allocate about 12GB for a prime table, > and (with primediff) this supports prime table up to 320 > billions, so one can sieve up to 10^23. > > The current implementation is not fully optimized, but still > it performs about 18 times quicker than the stock forprime() > even at 10^23. (With 128-bit numbers closer to 2^64 it > performs up to >90 times quicker!) > > Moreover, the dependence of speed on the size of numbers “follows a > simple law”. Extrapolating this law suggests that the crossover¹⁾ > happens about 0.9·10^26. > CONCLUSION: it makes sense to support prime tables up to 10^13. > > ¹⁽ This is the crossover with the CURRENT IMPLEMENTATION of > forprime()-via-nextprime(). As I explained earlier > http://megrez.math.u-bordeaux.fr/archives/pari-dev-2401/msg00049.html > I expect that this may be sped up about 5 times (see the line > in the middle of the attached image). > > If such an optimization is possible, the crossover happens at ∼2.5·10^24. OUPS! Thinking about this more: when investigating these questions, I did not try²⁾ to inspect the dependence of the speed of sieving on the size of the sieving arena! ¹⁾ My logic must have been that the larger the arena (which is forced to be larger than L3 cache due to slowness of calculation of remainder in the loop) the slower is going to be the memory access. WRONG! ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ KIND OF SHORT SUMMARY (“theoretical”) In a “theoretical situation”, when the available memory size is unbounded (but we assume that memory access performance is not hindered by this!), sieving is always going to be WAY quicker than nextprime(): when running through a region of numbers, the time per 10^9 numbers (“a giganumber”) near N is: the measured timing for forprime␣via␣nextprime() is >400 sec/giganumber; it is about 0.025 * log10(N)^2.5 + 310 sec/giganumber (N ≫ 10^19) the measured³⁾ timing for forprime␣via␣sieving() is about 0.19 * log10(N)^1.6 + 2 sec/giganumber (10^19 ≤ N ≤ 10^23) (with very raw optimizations only). This means⁴⁾ 5.4 … 6.6 sec/giganumber in the range above. Note the smaller powers and the smaller coefficient (since log10(N) ≫ 10). ³⁾ for 10^23 a bit of extrapolation in the arena size has been needed. ⁴⁾ … not counting sieving the initial run of primes up to √N. This takes a bit more than 1 sec/giganumber (since these primes are much smaller!). (About 6min for N ∼ 10²³.) ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ VERY Practical flavor I (with 10GB for the sieving arena) This way one can sieve via chunks of 160 giganumbers. This gives an extra slowdown — but it is only about 1sec/giganumber at ∼4·10²³. This is going to be quicker than forprime␣via␣nextprime up to the crossover at ∼5·10²⁹ (OR⁵⁾ at ∼1.6·10²⁷). ⁵⁾ This second number supposes a significant rewriting of the algorithm for forprime␣via␣nextprime (which I explained elsewhere). IN ALL THESE COMPARISON WE ASSUME that memory is enough for this and for the table of primes up to √N (requiring about 2√N/log N bytes with diffprimes). ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Same with 60GB for the sieving arena The slowdown due to finite arena (1 teranumber) is about 1sec/giganumber at ∼ 9·10²⁴. The crossover with forprime␣via␣nextprime is at ∼ 1.3·10³¹ (OR⁵⁾ at ∼4·10²⁸). ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Same with the sieving arena which is 10% of the size of diffprimes table (Slowdown is always >1 sec/giganumber.) The slowdown is <4 sec/GN (speed 11.3 sec/GN) up to 10²⁵. (This takes 110GB for diffprimes, and 11GB for the arena). The crossover with forprime␣via␣nextprime happens at ∼7·10⁷¹ (or 2·10⁴⁵). (Caveat: this is with unphysical amount of memory!). ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Same with the sieving arena which is 20% of the size of diffprimes table Slowdown is 1 sec/giganumber already at 4·10¹⁹. The slowdown is 2.6 sec/GN (speed 10 sec/GN) at 9e25 (this takes 320GB for diffprimes, and 64GB for the arena). The crossover with forprime␣via␣nextprime happens at ∼10⁷⁸ (OR⁵⁾ 10⁵¹) (with the same caveat). ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Same with the sieving arena which is 100% of the size of diffprimes table Slowdown is 1 sec/giganumber at ∼10²⁸ (this requires 0.1 petabytes for the primediff table!). Hence for all practical purposes the slowdown is negligible. The crossover with forprime␣via␣nextprime happens at ∼3·10⁹¹ (OR⁵⁾ 10⁶⁴) (with the same caveat). ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ The main term of the slowdown With the arena size α (requiring α/16 bytes of memory) the slowdown in sec/GN is ∼ 0.04 · α^-.78 · N^.44. The alternative way to describe this: In range 10²³..10⁴⁰ the overhead of small arenas is 1 sec/GN at the arena size α = 0.01·N^.56 ≈ 0.2 · √N · (N/5e21)^(1/17). The overhead decays as const * α^-.78 for smaller arenas. ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ YET ANOTHER CONCLUSION For primes above 2⁶⁴, the sieving gives speed up to 2 orders of magnitude better than forprime␣via␣nextprime — and (if one believes my extrapolations) the order(s)-of-magnitude advantage may be preserved at least up to 10³⁰ (given — possibly insane — available amount of memory needed to store the table of “small” primes — up to √N). Hope this helps, Ilya