Karim Belabas on Wed, 01 Mar 2023 11:43:46 +0100


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: unable to load a large vector from a file


* Max Alekseyev [2023-03-01 03:15]:
> Hi Nicolas,
> 
> My data fits Vecsmall() and your suggestion does work nicely. However, I
> don't think the file size is the issue here. While my original file has
> size ~10GB, the one created by writebin() is ~8GB, which is not much
> smaller. So the file size alone does not explain the failure of read(). I
> got an impression that read() on binary files does a better job on parsing
> the data and properly allocating memory.

There's no parsing involved when handling 'writebin' data and (very) few
computations: it's a straight memory dump / read (where all pointers are
serialized using a translation wrt to a base adress). Everything is done
in place and memory needed for read or write is
 2 * size of the GEN saved + O(1)
(and time is guaranteed to be linear in the size of the output)

When you 'write' or read back characer strings, on the other hand, lots
of expensive conversions occur (between base 2^64 and base 10 for
integers; much worse for polynomials or more complicated objects). 
This time, we struggle to get almost linear behaviour for integers
(but we do, due to non-trivial divide-and-conquer implementations)
and always get quadratic time behaviour for polynomials (think of how a
*computation* such as x^10000 + x^9999 + ... + 1 will take place, as written)
And memory use is no longer optimal; in principle it could easily get
quadratic and this is avoided by random garbage collection occuring in the
evaluator. How this fares in practice depends on parisizemax, RAM and
swap space usage, the actual memory layout, etc. In particular, if
parisizemax is too large, you'll enter swapping hell.

Upshot: just don't use write() for anything which is not meant to be
human readable. (In particular, anything large.)

Cheers,

    K.B.

P.S. In general, reading in a large vector guarantees worst case memory use.
It's much better to write one object per line and use readvec()

 Or maybe it's the Vecsmall() type
> that saved the day. read() was also quite fast here.
> Anyway, it'd be helpful to understand why my initial approach did not work.
> 
> Regards,
> Max
> 
> On Tue, Feb 28, 2023 at 2:41 PM Nicolas Mascot <mascotn@maths.tcd.ie> wrote:
> 
> > Hi Max,
> >
> > You could try to use writebin() instead of write(); this usually results
> > in smaller files.
> >
> > If your integers are small enough to fit on a long (I am not 100% sure,
> > but I think that the range is -2^63 to 2^63-1), you would also save a
> > lot of space by converting your vector to a vecsmall ( = vector of small
> > integers):
> > writebin("myvector",Vecsmall(v));
> > If needed, you can then turn the result back into a (regular) vector
> > with Vec():
> > Vec(read("myvector"))
> > but in most cases that should not be necessary.
> >
> > Best regards,
> > Nicolas
> >
> > On 28/02/2023 19:30, Max Alekseyev wrote:
> > > I have a vector v of size 10^9 with moderately-sized integers, which I
> > > saved to a file with
> > > write("myvector.txt",v);
> > > However, I'm unable to read it back. PARI/GP either runs out of memory
> > > (even with 96GB memory allocation!) or just killed by the system.
> > > I tried both
> > > v = read("myvector.txt");
> > > and
> > > \r myvector.txt
> > >
> > > Is there a way around?
> > >
> > > PS. The resulting file size is just 10GB and so it should not be an
> > > issue to read it as a whole into memory and then parse it, but PARI/GP
> > > somehow fails to do that.
> > >
> > > Regards,
> > > Max
> >
> >
> >

    K.B.
--
Pr Karim Belabas, U. Bordeaux, Vice-président en charge du Numérique
Institut de Mathématiques de Bordeaux UMR 5251 - (+33) 05 40 00 29 77
http://www.math.u-bordeaux.fr/~kbelabas/
`