Mike Day on Sun, 14 Jul 2019 18:30:02 +0200


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Using local variable for memoizing in recursion


Thanks.  Quite a lot to study.  I wondered about your definition of a vector within a loop, as I’d assumed its creation would be expensive in time and space,  but I’m happy to accept that its sort-of-implied loop is more efficient than an explicit for loop.

Cheers,

Mike

Please reply to mike_liz.day@tiscali.co.uk.      
Sent from my iPad

> On 14 Jul 2019, at 17:11, Karim Belabas <Karim.Belabas@math.u-bordeaux.fr> wrote:
> 
> * Mike Day [2019-07-14 15:03]:
>> I've just noticed an edge-condition failure in Karim's version (below):
>> need to replace
>>    vecsum(t[a-1]);  \\ at end of function by
>>    vecsum(t[kx]);   \\ was problem when a=1 !
> 
> Good catch :-)
> 
>> I'd accepted Dmitry's assertion that execution of a particular line took
>> half the execution time, so sought a way to avoid the calculation.
> 
> The assertion is plausible: the innermost loop is going to take 100% of CPU
> time and contains only 2 lines
> 
>     ky=1+(k1-j+i)%(k1); /* to be optimized, takes half the CPU time */
>     t[k2][j]=t[kx][j-1]+t[ky][j]);
> 
> which look of comparable cost (see below: more atomic steps in the first one
> but tiny inputs compared to the large addition and assignment in the second
> line).
> 
>> I still wonder, though,  how I can _examine_ a function's performance,
>> line by line, in order to optimise by hand/eye/brain...
> 
> I don't see an easy way in GP (esp. under Windows). The following usually
> works but is cumbersome: replace in the program the instruction block
> to be timed by a loop repeating the same block enough times that an
> individual timing become meaningfull, and surround the loop by something like
> 
>  gettime();
>  for(cnt = 1, MAXCNT, ...);
>  T += gettime();
> 
> Make sure the loop operates on dummy variables (say, my_t) so as to have no
> effect on the program flow. Then print T / MAXCNT and compare. One must also
> check than an empty loop for(cnt = 1, MAXCNT, ) takes negligible time [will be
> the case here] or take that into account also...
> 
> What I sometimes do is have a look at gprof [external profiling tool] and
> decide from there. Just did it: in this case, it's hard to identify what takes
> time from the C function names:
> 
> < gprof excerpt, ran on the original pnkv_3(10000,2000) >
> index % time    self  children    called     name
> [3]     92.0    3.05    3.64 20010027+1       closure_eval <cycle 2> [3]
>                0.94    0.53 60014000/60014000     gadd [4]
>                0.05    0.35 20008000/20008000     change_compo [7]
>                0.01    0.25 20008003/20008003     changelex [9]
>                0.07    0.17 20007999/20007999     gmod [10]
>                0.12    0.07 40016003/80028023     gunclone_deep [8]
>                0.19    0.00 120008001/120008001     check_array_index [12]
>                0.09    0.09 40011998/40011998     copylex [13]
>                0.14    0.00 80002000/80002000     closure_castgen [16]
>                0.10    0.00 40016010/40016010     stoi [22]
>                0.05    0.00 39996001/39996001     gsub [28]
> 
> (N.B. The timings from gprof are usually inaccurate but give a rough idea
> of relative cost; the number of calls is more useful.)
> 
> A line such as
> 
>     ky=1+(k1-j+i)%(k1); /* to be optimized, takes half the CPU time */
> 
> translates to 2 gadd, 1 gsub, 1 gmod and 1 changelex. It is run ~ 2.10^7 times
> (the number of 'gmod' in gprof output, but also ~ n * k from direct code
> inspection)
> 
> The other line in the innermost loop
> 
>     t[k2][j]=t[kx][j-1]+t[ky][j]);
> 
> translates to 1 gadd, 1 gsub, 1 change_compo (=) and a 6 check_array_index
> ([] construct). Note that this latter gadd operates on much larger inputs than
> the preceding ones [ or the gsub and gmod ] from the other line but it's
> impossible to estimate that only from gprof's output.
> 
> 
> My version returns this:
> 
> < gprof excerpt, ran on my pnk_4(10000,2000) >
> index % time    self  children    called     name
> [3]     91.0    3.06    3.09 40018025+1       closure_eval <cycle 2> [3]
>                0.96    0.25 19988002/19990001     gadd [4]
>                0.02    0.31 20038000/20038000     changelex [6]
>                0.17    0.07 40055999/40074019     gunclone_deep [8]
>                0.19    0.00 79972003/79972003     closure_castgen [10]
>                0.04    0.13 19998000/19998000     geq [11]
>                0.14    0.00 40028027/40028027     pari_stack_alloc [13]
>                0.03    0.10 19998000/19998000     gsub1e [16]
>                0.11    0.00 79982004/79982004     check_array_index [18]
>                0.10    0.00 19988002/19988002     gsub [19]
>                0.08    0.00 39986017/39986017     stoi [22]
> 
> and 100% of the CPU time is now concentrated on the following three lines:
> 
>     if (!(ky--), ky = k1);
>     if (j == 1, t[ky][1]
>               , t[kx][j-1] + t[ky][j]));
> 
> The reasons it's a little faster:
> 1) I removed the assignments to the t[k+2] accumulator using the
>   t[a] = vector() construct [which replaces t[a] in one atomic step
>   instead of changing in place the t[a][j], which wouldn't work since the loop
>   still needs the values)
> 2) I removed many operations on small operands; letting the implicit loop
> in vector() do some of them in a faster way (+ and assignment).
> 
>> And while we're at it,  is Dmitry's use of a vector of vectors, t, to be
>> preferred to my earlier use of a matrix, also t,  with similar dimensions?
> 
> A vector of vectors will be a little faster but it should not make a major
> difference. What made a difference is to replace a for() loop by a vector()
> construct in the innermost loop and removing the "deep" assignments
> t[k2][j] = ...
> (we still have an equivalent of the global copy t[kx] = t[k2] in the t[a] =
> vector(...))
> 
> Cheers,
> 
>    K.B.
> --
> Karim Belabas, IMB (UMR 5251)  Tel: (+33) (0)5 40 00 26 17
> Universite de Bordeaux         Fax: (+33) (0)5 40 00 21 23
> 351, cours de la Liberation    http://www.math.u-bordeaux.fr/~kbelabas/
> F-33405 Talence (France)       http://pari.math.u-bordeaux.fr/  [PARI/GP]
> `