Gonzalo Tornaria on Tue, 20 May 2003 11:19:59 -0500

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: charpoly using too much stack space !

On Mon, May 19, 2003 at 07:05:26PM +0200, Karim BELABAS wrote:
> By killing a clone a bit earlier, I can reduce memory use by a further 30%,
> but that wouldn't work in the presence of t_INTMOD / t_PADIC / t_POLMOD
> because in general such objects obtained from out-of-stack data (such as
> clones) contain pointers to that data, not copies. So I can't delete the
> initial clone before creating the new one [ which transforms all such
> pointers into hard copies ].

I wouldn't stop sleeping for this 30%. The O(n) factor you and Bill just
removed is enough for me :-)

> > Notably, if the matrix has only 400 1's, and the rest 0's, it only
> > takes 8,5 secs; on the other hand, I can do it in 1,2 secs with a GP
> > program that "knows" how to deal with such a sparse matrix.
> This is an interesting observation. I have tweaked general matrix
> multiplication so as to take advantage of many zero entries [ there's still
> no specialized treatment and representation for sparse matrices ! ]. It has a
> negligible impact on dense multiplication and a huge one on very sparse
> matrices. This rather improves the charpoly() routine, though not to the
> extent you mention. Committed to CVS.

The difference was impressive (a factor of 4 to 5, see below).

> A further optimization for +1 entries is complicated/unsafe ( there are many
> things which can be "equal to 1" and still not be neutral wrt multiplication ),
> and only yields marginal improvements ( say 15% at most ), so I've not
> committed this one.

Well, the t_INT 1 is certainly neutral wrt multiplication (or isn't?).
That alone would be useful, but 15% is not that important. What I think
is that just one multiplication cannot be improved too much more, BUT if
we need to multiply several times by the same sparse matrix, we should
"study" the matrix first (just once), and then use this every time. This
is perfect for charpoly, where all the work amounts to "n"
multiplications by the same (presumably sparse) matrix.  Note that the
intermediate results are NOT sparse (in general).

I'll follow this in a different message, since I'm getting very
interesting results.

> Can you update from CVS and check the speed improvement your GP
> implementation provides ? [ and could you post the latter in the case it
> still outperforms the native routine ? ]

Here are my results (P4 2.0 GHz) my matrices are adjacency matrices of
4-regular directed graphs, which is to say that they have 4 1's in each
row, and the rest are 0's.

100x100: PARI-before=  8,3s, PARI-after=  1,9s, GP=  1,0s
200x200: PARI-before=132,8s, PARI-after= 26,1s, GP=  7,7s
500x500: PARI-before=  ?,?h, PARI-after=882,6s, GP=153,6s

[Note: everything computed with 16e6 stack, except the PARI 500x500 for
which I needed to enlarge the stack; I believe it worked with 64e6]

GM/CS/S d? a-- C++(+++)  UL+++(++++)  P++>+++  L+++>++++ E---
W-(+) N+(++) w--- O M-(--) V-(--) PGP(--) b++(+++) G++ e>++++

... A mathematician is a machine for converting coffee into theorems.