Karim Belabas on Sun, 14 Jul 2019 18:12:01 +0200


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Fwd: Re: Using local variable for memoizing in recursion


* Mike Day [2019-07-14 15:03]:
> I've just noticed an edge-condition failure in Karim's version (below):
> need to replace
>    vecsum(t[a-1]);  \\ at end of function by
>    vecsum(t[kx]);   \\ was problem when a=1 !

Good catch :-)

> I'd accepted Dmitry's assertion that execution of a particular line took
> half the execution time, so sought a way to avoid the calculation.

The assertion is plausible: the innermost loop is going to take 100% of CPU
time and contains only 2 lines

     ky=1+(k1-j+i)%(k1); /* to be optimized, takes half the CPU time */
     t[k2][j]=t[kx][j-1]+t[ky][j]);

which look of comparable cost (see below: more atomic steps in the first one
but tiny inputs compared to the large addition and assignment in the second
line).

> I still wonder, though,  how I can _examine_ a function's performance,
> line by line, in order to optimise by hand/eye/brain...

I don't see an easy way in GP (esp. under Windows). The following usually
works but is cumbersome: replace in the program the instruction block
to be timed by a loop repeating the same block enough times that an
individual timing become meaningfull, and surround the loop by something like

  gettime();
  for(cnt = 1, MAXCNT, ...);
  T += gettime();

Make sure the loop operates on dummy variables (say, my_t) so as to have no
effect on the program flow. Then print T / MAXCNT and compare. One must also
check than an empty loop for(cnt = 1, MAXCNT, ) takes negligible time [will be
the case here] or take that into account also...

What I sometimes do is have a look at gprof [external profiling tool] and
decide from there. Just did it: in this case, it's hard to identify what takes
time from the C function names:

< gprof excerpt, ran on the original pnkv_3(10000,2000) >
index % time    self  children    called     name
[3]     92.0    3.05    3.64 20010027+1       closure_eval <cycle 2> [3]
                0.94    0.53 60014000/60014000     gadd [4]
                0.05    0.35 20008000/20008000     change_compo [7]
                0.01    0.25 20008003/20008003     changelex [9]
                0.07    0.17 20007999/20007999     gmod [10]
                0.12    0.07 40016003/80028023     gunclone_deep [8]
                0.19    0.00 120008001/120008001     check_array_index [12]
                0.09    0.09 40011998/40011998     copylex [13]
                0.14    0.00 80002000/80002000     closure_castgen [16]
                0.10    0.00 40016010/40016010     stoi [22]
                0.05    0.00 39996001/39996001     gsub [28]

(N.B. The timings from gprof are usually inaccurate but give a rough idea
of relative cost; the number of calls is more useful.)

A line such as

     ky=1+(k1-j+i)%(k1); /* to be optimized, takes half the CPU time */

translates to 2 gadd, 1 gsub, 1 gmod and 1 changelex. It is run ~ 2.10^7 times
(the number of 'gmod' in gprof output, but also ~ n * k from direct code
inspection)

The other line in the innermost loop

     t[k2][j]=t[kx][j-1]+t[ky][j]);

translates to 1 gadd, 1 gsub, 1 change_compo (=) and a 6 check_array_index
([] construct). Note that this latter gadd operates on much larger inputs than
the preceding ones [ or the gsub and gmod ] from the other line but it's
impossible to estimate that only from gprof's output.


My version returns this:

< gprof excerpt, ran on my pnk_4(10000,2000) >
index % time    self  children    called     name
[3]     91.0    3.06    3.09 40018025+1       closure_eval <cycle 2> [3]
                0.96    0.25 19988002/19990001     gadd [4]
                0.02    0.31 20038000/20038000     changelex [6]
                0.17    0.07 40055999/40074019     gunclone_deep [8]
                0.19    0.00 79972003/79972003     closure_castgen [10]
                0.04    0.13 19998000/19998000     geq [11]
                0.14    0.00 40028027/40028027     pari_stack_alloc [13]
                0.03    0.10 19998000/19998000     gsub1e [16]
                0.11    0.00 79982004/79982004     check_array_index [18]
                0.10    0.00 19988002/19988002     gsub [19]
                0.08    0.00 39986017/39986017     stoi [22]

and 100% of the CPU time is now concentrated on the following three lines:

     if (!(ky--), ky = k1);
     if (j == 1, t[ky][1]
               , t[kx][j-1] + t[ky][j]));

The reasons it's a little faster:
1) I removed the assignments to the t[k+2] accumulator using the
   t[a] = vector() construct [which replaces t[a] in one atomic step
   instead of changing in place the t[a][j], which wouldn't work since the loop
   still needs the values)
2) I removed many operations on small operands; letting the implicit loop
in vector() do some of them in a faster way (+ and assignment).

> And while we're at it,  is Dmitry's use of a vector of vectors, t, to be
> preferred to my earlier use of a matrix, also t,  with similar dimensions?

A vector of vectors will be a little faster but it should not make a major
difference. What made a difference is to replace a for() loop by a vector()
construct in the innermost loop and removing the "deep" assignments
 t[k2][j] = ...
(we still have an equivalent of the global copy t[kx] = t[k2] in the t[a] =
vector(...))

Cheers,

    K.B.
--
Karim Belabas, IMB (UMR 5251)  Tel: (+33) (0)5 40 00 26 17
Universite de Bordeaux         Fax: (+33) (0)5 40 00 21 23
351, cours de la Liberation    http://www.math.u-bordeaux.fr/~kbelabas/
F-33405 Talence (France)       http://pari.math.u-bordeaux.fr/  [PARI/GP]
`