Ilya Zakharevich on Thu, 12 Sep 2024 15:49:01 +0200 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64 |
On Sun, Sep 08, 2024 at 03:24:53PM +0200, Bill Allombert wrote: > On Sun, Aug 18, 2024 at 06:17:50AM -0700, Ilya Zakharevich wrote: > > On a contemporary low-midlevel CPU (13th Gen Intel(R) Core(TM) i5-13500T), with > > > > timeit(s,f) = my(t=gettime(), o=f()); if(#o, o=Str("\t",o)); printf("%s%7.3f%s\n",s,gettime()/1000,o); > > my(c, L=2*10^9); for(k=8,19, c=0; timeit(Strprintf("10^%d\t",k), (()->forprime(p=10^k,10^k+L,c++);c))) > > > > SUMMARY: with this patch, the overhead of forprime in the range > > 2^32..2^64 is a constant overhead of “1.5 increments (as in c++)” per > > a found prime, plus the overhead of “√(N/10^19) increments” for > > sieving near N. > > Could explain what rem_half does ? Half-width divll() — less storing the result of division. (I did not check — maybe the “whole ½-width analogue” of divll() would be compiled into the same assembler code — by optimizing away the whole bruhaha with hiremainder… Just to be safe, I modified the version from x86 assembler to return only the remainder.) > If I understand correctly, > you are suggesting to change is optimal_chunk: > ulong chunk = 0x80000UL; > to > ulong chunk = 0x800000UL; > > Is it correct ? No. The main bug was in “what are the units of the ‘chunk’ variable”. The code in one place thinks that it counts “numbers-to-sieve”, in the other that it counts “bytes-in-the-arena”. (These differ 16 times, which led to ∼ order of magnitude slowdown. — Which in turn led to — ridiculous — claims about no-need-for-longer-prime-tables… And nobody thought that “extraordinary claims require extraordinary proofs ;―) .) Yours, Ilya