Peter Bruin on Thu, 09 Mar 2017 11:01:26 +0100


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Linear algebra via CUP decomposition and reduction to matrix multiplication


Hello,

Here is a set of patches that speeds up many linear algebra computations
over fields of small characteristic (Flm, FlxqM) by using the CUP matrix
decomposition [1].  This decomposes any m × n matrix of rank r over a
field as C*U*P with C in column echelon form of size m × r, U upper
triangular of size r × n and P a permutation matrix of size n × n.

Computing such a decomposition, and applying it to solving linear
systems and other linear algebra operations, can be done using a
divide-and-conquer approach that finally reduces everything to matrix
multiplication.  If ω is the exponent of matrix multiplication, then
computing a CUP decomposition and doing other linear algebra operations
can be done in O(mnr^{ω-2}) field operations.

So far I have only implemented the Flm and FlxqM variants; adding other
finite fields should be straightforward.  In principle it should be
possible to use the bb_field framework.  However, this is not the most
natural approach in my opinion, since the building blocks in this new
code are not field operations, but matrix operations (matrix addition,
matrix multiplication, scalar-matrix multiplication and solving
triangular systems).

Since matrix multiplication over Flxq fields is already reduced to
matrix multiplication over Z via Kronecker substition, and since we have
Strassen multiplication over Z, linear algebra over Flxq fields now
becomes asymptotically faster than O(n^3), namely O(n^{log_2(7)}).
Strassen multiplication over Fl fields is not implemented yet but would
be easy to add, although the matrix sizes for which it will have a
substantial effect will probably be larger than over Flxq fields.

The code also performs well in practice: almost all linear algebra
routines become several times faster for large matrices, and in the
benchmarks I used (see timings.gp, timings-old.txt, timings-new.txt),
the total time went down by a factor around 4.76.

Matrix operations over Z that are implemented using a multimodular
algorithm (determinant, kernel, pivots) also profit from the speedup of
the corresponding Fl algorithms.  The examples that I tested
(determinants and kernels of 120 x 120 matrices, computing a space of
modular symbols and its new subspace) are now about 30% faster.

The first two patches add the following functions that are used by the
new code: perm_sign (sign of a permutation), FlxC_neg, FlxC_sub,
FlxC_to_ZXC, FlxqC_Flxq_mul, FlxqM_Flxq_mul, zero_FlxC, zero_FlxM.

Most of the new code is contained in the third patch.  It is quite big
(1234 insertions/deletions in alglin1.c), but I have tried to make it
fit as cleanly as possible into the existing code.  Some linear algebra
functions only need an echelon form, not a full CUP decomposition.  The
echelon form could simply be computed via the CUP decomposition, which
would not be much slower but uses much more memory, so I wrote separate
functions for computing echelon forms.

In the fourth patch, I added a number of tests to make sure the new code
is covered by the testsuite as much as possible (up to some gerepile
code and a few lines that are hard to reach from GP).  Besides that,
throughout the writing of this code I have already used it extensively
in my computations with curves over finite fields.

There is one small user-visible change: the various gauss_pivot and
indexrank functions used to return the list of pivot columns in the row
echelon form, and for each pivot column the first possible pivot row.
The new algorithm instead returns the list of pivot rows in the column
echelon from, and for each pivot row some choice of pivot column.  This
explains the changed output in the "nf" and "rnfkummer" tests in the
last patch.  I checked that the new output is mathematically equivalent
to the old output.

Thanks,

Peter


[1] C.-P. Jeannerod, C. Pernet and A. Storjohann, Rank-profile revealing
Gaussian elimination and the CUP matrix decomposition.  J. Symbolic
Comput. 56 (2013), 46–68.


>From 5e46827e76565c843406ca548bc68e3185299585 Mon Sep 17 00:00:00 2001
From: Peter Bruin <P.J.Bruin@math.leidenuniv.nl>
Date: Thu, 16 Feb 2017 07:45:17 +0100
Subject: [PATCH 1/5] add function perm_sign

---
 doc/usersch5.tex       |  2 ++
 src/basemath/perm.c    | 12 ++++++++++++
 src/headers/paridecl.h |  1 +
 3 files changed, 15 insertions(+)

diff --git a/doc/usersch5.tex b/doc/usersch5.tex
index 029e1f3..8dfd02f 100644
--- a/doc/usersch5.tex
+++ b/doc/usersch5.tex
@@ -11629,6 +11629,8 @@ a product of disjoint cycles (\kbd{cyc}); return $p^n$ (as a \kbd{cyc}).
 \fun{long}{perm_order}{GEN p} returns the order of the permutation $p$
 (as the lcm of its cycle lengths).
 
+\fun{long}{perm_sign}{GEN p} returns the sign of the permutation $p$.
+
 \fun{GEN}{vecperm_orbits}{GEN p, long n} the permutation $p\in S_n$ being
 given as a product of disjoint cycles, return the orbits of the subgroup
 generated by $p$ on $\{1,2,\ldots,n\}$.
diff --git a/src/basemath/perm.c b/src/basemath/perm.c
index 7931397..291e1ca 100644
--- a/src/basemath/perm.c
+++ b/src/basemath/perm.c
@@ -316,6 +316,18 @@ perm_order(GEN v)
   avma = ltop; return d;
 }
 
+/* sign of a permutation */
+long
+perm_sign(GEN v)
+{
+  pari_sp av = avma;
+  GEN c = vecperm_orbits_i(mkvec(v), lg(v)-1);
+  long i, l = lg(c), s = 1;
+  for (i = 1; i < l; i++)
+    if (odd(lg(gel(c, i)))) s = -s;
+  avma = av; return s;
+}
+
 static long
 isperm(GEN v)
 {
diff --git a/src/headers/paridecl.h b/src/headers/paridecl.h
index e97d275..d048f22 100644
--- a/src/headers/paridecl.h
+++ b/src/headers/paridecl.h
@@ -3916,6 +3916,7 @@ int     perm_commute(GEN p, GEN q);
 GEN     perm_cycles(GEN v);
 long    perm_order(GEN perm);
 GEN     perm_pow(GEN perm, long exp);
+long    perm_sign(GEN perm);
 long    permorder(GEN perm);
 GEN     quotient_group(GEN C, GEN G);
 GEN     quotient_perm(GEN C, GEN p);
-- 
2.7.3

>From af6892eb0708a6c1c1b216562e1dd3d44c9dee76 Mon Sep 17 00:00:00 2001
From: Peter Bruin <P.J.Bruin@math.leidenuniv.nl>
Date: Tue, 10 Jan 2017 16:09:10 +0100
Subject: [PATCH 2/5] add various functions for FlxV, FlxM, FlxqV and FlxqM

---
 doc/usersch5.tex       | 19 +++++++++++-
 src/basemath/Flx.c     | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++
 src/headers/paridecl.h | 10 +++++-
 3 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/doc/usersch5.tex b/doc/usersch5.tex
index 8dfd02f..4d24ea9 100644
--- a/doc/usersch5.tex
+++ b/doc/usersch5.tex
@@ -4256,7 +4256,8 @@ and \kbd{y}
 \kbd{deplin=0}) or \kbd{F2m\_deplin} (if \kbd{deplin=1}), in place
 (destroys~\kbd{x}).
 
-\subsec{\kbd{FlxqV}, \kbd{FlxqM}} See \kbd{FqV}, \kbd{FqM} operations.
+\subsec{\kbd{FlxqV}, \kbd{FlxqC}, \kbd{FlxqM}}
+See \kbd{FqV}, \kbd{FqC}, \kbd{FqM} operations.
 
 \fun{GEN}{FlxqV_dotproduct}{GEN x, GEN y, GEN T, ulong p} as
 \kbd{FpV\_dotproduct}.
@@ -4264,6 +4265,10 @@ and \kbd{y}
 \fun{GEN}{FlxM_Flx_add_shallow}{GEN x, GEN y, ulong p} as
 \kbd{RgM\_Rg\_add\_shallow}.
 
+\fun{GEN}{FlxqC_Flxq_mul}{GEN x, GEN y, GEN T, ulong p}
+
+\fun{GEN}{FlxqM_Flxq_mul}{GEN x, GEN y, GEN T, ulong p}
+
 \fun{GEN}{FlxqM_FlxqC_gauss}{GEN a, GEN b, GEN T, ulong p}
 
 \fun{GEN}{FlxqM_FlxqC_invimage}{GEN a, GEN b, GEN T, ulong p}
@@ -5506,11 +5511,23 @@ returns their product.
 \fun{ulong}{FlxC_eval_powers_pre}{GEN x, GEN y, ulong p, ulong pi}
 apply \kbd{Flx\_eval\_powers\_pre} to all elements of \kbd{x}.
 
+\fun{GEN}{FlxC_neg}{GEN x, ulong p}
+
+\fun{GEN}{FlxC_sub}{GEN x, GEN y, ulong p}
+
+\fun{GEN}{zero_FlxC}{long n, long sv}
+
 \subsec{\kbd{FlxM}} See \kbd{FpXM} operations.
 
 \fun{ulong}{FlxM_eval_powers_pre}{GEN M, GEN y, ulong p, ulong pi}
 this function applies \kbd{FlxC\_eval\_powers\_pre} to all entries of \kbd{M}.
 
+\fun{GEN}{FlxM_neg}{GEN x, ulong p}
+
+\fun{GEN}{FlxM_sub}{GEN x, GEN y, ulong p}
+
+\fun{GEN}{zero_FlxM}{long r, long c, long sv}
+
 \subsec{\kbd{FlxT}} See \kbd{FpXT} operations.
 
 \fun{GEN}{FlxT_red}{GEN V, ulong p} reduces each leaf with \kbd{Flx\_red}.
diff --git a/src/basemath/Flx.c b/src/basemath/Flx.c
index a9f2c1b..ec9eb7b 100644
--- a/src/basemath/Flx.c
+++ b/src/basemath/Flx.c
@@ -3567,6 +3567,68 @@ FlxM_eval_powers_pre(GEN z, GEN x, ulong p, ulong pi)
   return y;
 }
 
+GEN
+zero_FlxC(long n, long sv)
+{
+  long i;
+  GEN x = cgetg(n + 1, t_COL);
+  GEN z = zero_Flx(sv);
+  for (i = 1; i <= n; i++)
+    gel(x, i) = z;
+  return x;
+}
+
+GEN
+FlxC_neg(GEN x, ulong p)
+{
+  long i, l = lg(x);
+  GEN z = cgetg(l, t_COL);
+  for (i = 1; i < l; i++)
+    gel(z, i) = Flx_neg(gel(x, i), p);
+  return z;
+}
+
+GEN
+FlxC_sub(GEN x, GEN y, ulong p)
+{
+  long i, l = lg(x);
+  GEN z = cgetg(l, t_COL);
+  for (i = 1; i < l; i++)
+    gel(z, i) = Flx_sub(gel(x, i), gel(y, i), p);
+  return z;
+}
+
+GEN
+zero_FlxM(long r, long c, long sv)
+{
+  long j;
+  GEN x = cgetg(c + 1, t_MAT);
+  GEN z = zero_FlxC(r, sv);
+  for (j = 1; j <= c; j++)
+    gel(x, j) = z;
+  return x;
+}
+
+GEN
+FlxM_neg(GEN x, ulong p)
+{
+  long j, l = lg(x);
+  GEN z = cgetg(l, t_MAT);
+  for (j = 1; j < l; j++)
+    gel(z, j) = FlxC_neg(gel(x, j), p);
+  return z;
+}
+
+GEN
+FlxM_sub(GEN x, GEN y, ulong p)
+{
+  long j, l = lg(x);
+  GEN z = cgetg(l, t_MAT);
+  for (j = 1; j < l; j++)
+    gel(z, j) = FlxC_sub(gel(x, j), gel(y, j), p);
+  return z;
+}
+
 /***********************************************************************/
 /**                                                                   **/
 /**                               FlxX                                **/
@@ -4831,6 +4893,26 @@ FlxqV_roots_to_pol(GEN V, GEN T, ulong p, long v)
 
 /*** FlxqM ***/
 
+GEN
+FlxqC_Flxq_mul(GEN x, GEN y, GEN T, ulong p)
+{
+  long i, l = lg(x);
+  GEN z = cgetg(l, t_COL);
+  for (i = 1; i < l; i++)
+    gel(z, i) = Flxq_mul(gel(x, i), y, T, p);
+  return z;
+}
+
+GEN
+FlxqM_Flxq_mul(GEN x, GEN y, GEN T, ulong p)
+{
+  long j, l = lg(x);
+  GEN z = cgetg(l, t_MAT);
+  for (j = 1; j < l; j++)
+    gel(z, j) = FlxqC_Flxq_mul(gel(x, j), y, T, p);
+  return z;
+}
+
 static GEN
 kron_pack_Flx_spec_half(GEN x, long l) {
   if (l == 0)
diff --git a/src/headers/paridecl.h b/src/headers/paridecl.h
index d048f22..d53b8e0 100644
--- a/src/headers/paridecl.h
+++ b/src/headers/paridecl.h
@@ -320,10 +320,14 @@ GEN     Flx_to_ZX_inplace(GEN z);
 GEN     Flx_triple(GEN y, ulong p);
 long    Flx_val(GEN x);
 long    Flx_valrem(GEN x, GEN *Z);
-GEN     FlxC_to_ZXC(GEN x);
 GEN     FlxC_eval_powers_pre(GEN z, GEN x, ulong p, ulong pi);
+GEN     FlxC_neg(GEN x, ulong p);
+GEN     FlxC_sub(GEN x, GEN y, ulong p);
+GEN     FlxC_to_ZXC(GEN x);
 GEN     FlxM_Flx_add_shallow(GEN x, GEN y, ulong p);
 GEN     FlxM_eval_powers_pre(GEN z, GEN x, ulong p, ulong pi);
+GEN     FlxM_neg(GEN x, ulong p);
+GEN     FlxM_sub(GEN x, GEN y, ulong p);
 GEN     FlxM_to_ZXM(GEN z);
 GEN     FlxT_red(GEN z, ulong p);
 GEN     FlxV_Flc_mul(GEN V, GEN W, ulong p);
@@ -385,6 +389,8 @@ GEN     Flxq_sqr(GEN y,GEN T,ulong p);
 GEN     Flxq_sqrt(GEN a, GEN T, ulong p);
 GEN     Flxq_sqrtn(GEN a, GEN n, GEN T, ulong p, GEN *zetan);
 ulong   Flxq_trace(GEN x, GEN T, ulong p);
+GEN     FlxqC_Flxq_mul(GEN x, GEN y, GEN T, ulong p);
+GEN     FlxqM_Flxq_mul(GEN x, GEN y, GEN T, ulong p);
 GEN     FlxqV_dotproduct(GEN x, GEN y, GEN T, ulong p);
 GEN     FlxqV_roots_to_pol(GEN V, GEN T, ulong p, long v);
 GEN     FlxqX_FlxqXQ_eval(GEN Q, GEN x, GEN S, GEN T, ulong p);
@@ -443,6 +449,8 @@ GEN     pol1_FlxX(long v, long sv);
 GEN     polx_FlxX(long v, long sv);
 GEN     random_Flx(long d1, long v, ulong p);
 GEN     random_FlxqX(long d1, long v, GEN T, ulong p);
+GEN     zero_FlxC(long n, long sv);
+GEN     zero_FlxM(long r, long c, long sv);
 GEN     zx_to_Flx(GEN x, ulong p);
 GEN     zxX_to_FlxX(GEN B, ulong p);
 GEN     zxX_to_Kronecker(GEN P, GEN Q);
-- 
2.7.3

>From 7eac9551987a0bf59c25bef1de14dcafd480f42c Mon Sep 17 00:00:00 2001
From: Peter Bruin <P.J.Bruin@math.leidenuniv.nl>
Date: Thu, 16 Feb 2017 07:11:34 +0100
Subject: [PATCH 3/5] implement Flm_echelon, Flm_CUP, FlxqM_echelon, FlxqM_CUP

use echelon form/CUP factorisation for deplin, det, gauss, image,
indexrank, inv, invimage, ker, rank, suppl
---
 src/basemath/alglin1.c | 1234 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 1191 insertions(+), 43 deletions(-)

diff --git a/src/basemath/alglin1.c b/src/basemath/alglin1.c
index 83ecc62..a9ee388 100644
--- a/src/basemath/alglin1.c
+++ b/src/basemath/alglin1.c
@@ -527,6 +527,779 @@ image_from_pivot(GEN x, GEN d, long r)
 
 /*******************************************************************/
 /*                                                                 */
+/*                Echelon form and CUP decomposition               */
+/*                                                                 */
+/*******************************************************************/
+
+/*
+  Decompose an m x n-matrix A of rank r as C*U*P, with
+  - C: m x r-matrix in column echelon form (not necessarily reduced)
+       with all pivots equal to 1
+  - U: upper-triangular r x n-matrix
+  - P: permutation matrix
+  The pivots of C and the known zeroes in C and U are not necessarily
+  filled in; instead, we also return the vector R of pivot rows.
+  Instead of the matrix P, we return the permutation p of [1..n]
+  (t_VECSMALL) such that P[i,j] = 1 if and only if j = p[i].
+*/
+
+/* complement of a strictly increasing subsequence of (1, 2, ..., n) */
+static GEN
+indexcompl(GEN v, long n) {
+  long i, j, k, m = lg(v) - 1;
+  GEN w = cgetg(n - m + 1, t_VECSMALL);
+  for (i = j = k = 1; i <= n; i++) {
+    if (j <= m && v[j] == i)
+      j++;
+    else
+      w[k++] = i;
+  }
+  return w;
+}
+
+static GEN
+Flm_rsolve_upper_1(GEN U, GEN B, ulong p) {
+  return Flm_Fl_mul(B, Fl_inv(ucoeff(U, 1, 1), p), p);
+}
+
+static GEN
+Flm_rsolve_upper_2(GEN U, GEN B, ulong p) {
+  ulong a = ucoeff(U, 1, 1), b = ucoeff(U, 1, 2), d = ucoeff(U, 2, 2);
+  ulong D = Fl_mul(a, d, p), Dinv = Fl_inv(D, p);
+  ulong ainv = Fl_mul(d, Dinv, p), dinv = Fl_mul(a, Dinv, p);
+  GEN B1 = rowslice(B, 1, 1);
+  GEN B2 = rowslice(B, 2, 2);
+  GEN X2 = Flm_Fl_mul(B2, dinv, p);
+  GEN X1 = Flm_Fl_mul(Flm_sub(B1, Flm_Fl_mul(X2, b, p), p),
+                      ainv, p);
+  return vconcat(X1, X2);
+}
+
+/* solve U*X = B,  U upper triangular and invertible */
+static GEN
+Flm_rsolve_upper(GEN U, GEN B, ulong p) {
+  long n = lg(U) - 1, n1;
+  GEN U1, U2, U11, U12, U22;
+  GEN B1, B2, X1, X2, X;
+  pari_sp av = avma;
+
+  if (n == 0)
+    return B;
+  if (n == 1)
+    return Flm_rsolve_upper_1(U, B, p);
+  if (n == 2)
+    return Flm_rsolve_upper_2(U, B, p);
+  n1 = (n + 1)/2;
+  U1 = vecslice(U, 1, n1);
+  U2 = vecslice(U, n1 + 1, n);
+  U11 = rowslice(U1, 1, n1);
+  U12 = rowslice(U2, 1, n1);
+  U22 = rowslice(U2, n1 + 1, n);
+  B1 = rowslice(B, 1, n1);
+  B2 = rowslice(B, n1 + 1, n);
+  X2 = Flm_rsolve_upper(U22, B2, p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 4, &B1, &U11, &U12, &X2);
+  B1 = Flm_sub(B1, Flm_mul(U12, X2, p), p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 3, &B1, &U11, &X2);
+  X1 = Flm_rsolve_upper(U11, B1, p);
+  X = vconcat(X1, X2);
+  if (gc_needed(av, 1))
+    X = gerepilecopy(av, X);
+  return X;
+}
+
+static GEN
+Flm_lsolve_upper_1(GEN U, GEN B, ulong p) {
+  return Flm_Fl_mul(B, Fl_inv(ucoeff(U, 1, 1), p), p);
+}
+
+static GEN
+Flm_lsolve_upper_2(GEN U, GEN B, ulong p) {
+  ulong a = ucoeff(U, 1, 1), b = ucoeff(U, 1, 2), d = ucoeff(U, 2, 2);
+  ulong D = Fl_mul(a, d, p), Dinv = Fl_inv(D, p);
+  ulong ainv = Fl_mul(d, Dinv, p), dinv = Fl_mul(a, Dinv, p);
+  GEN B1 = vecslice(B, 1, 1);
+  GEN B2 = vecslice(B, 2, 2);
+  GEN X1 = Flm_Fl_mul(B1, ainv, p);
+  GEN X2 = Flm_Fl_mul(Flm_sub(B2, Flm_Fl_mul(X1, b, p), p),
+                      dinv, p);
+  return shallowconcat(X1, X2);
+}
+
+/* solve X*U = B,  U upper triangular and invertible */
+static GEN
+Flm_lsolve_upper(GEN U, GEN B, ulong p) {
+  long n = lg(U) - 1, n1;
+  GEN U1, U2, U11, U12, U22;
+  GEN B1, B2, X1, X2, X;
+  pari_sp av = avma;
+
+  if (n == 0)
+    return B;
+  if (n == 1)
+    return Flm_lsolve_upper_1(U, B, p);
+  if (n == 2)
+    return Flm_lsolve_upper_2(U, B, p);
+  n1 = (n + 1)/2;
+  U1 = vecslice(U, 1, n1);
+  U2 = vecslice(U, n1 + 1, n);
+  U11 = rowslice(U1, 1, n1);
+  U12 = rowslice(U2, 1, n1);
+  U22 = rowslice(U2, n1 + 1, n);
+  B1 = vecslice(B, 1, n1);
+  B2 = vecslice(B, n1 + 1, n);
+  X1 = Flm_lsolve_upper(U11, B1, p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 4, &B2, &U12, &U22, &X1);
+  B2 = Flm_sub(B2, Flm_mul(X1, U12, p), p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 3, &B2, &U22, &X1);
+  X2 = Flm_lsolve_upper(U22, B2, p);
+  X = shallowconcat(X1, X2);
+  if (gc_needed(av, 1))
+    X = gerepilecopy(av, X);
+  return X;
+}
+
+static GEN
+Flm_rsolve_lower_unit_1(GEN L, GEN A, ulong p) {
+  return rowslice(A, 1, 1);
+}
+
+static GEN
+Flm_rsolve_lower_unit_2(GEN L, GEN A, ulong p) {
+  GEN X1 = rowslice(A, 1, 1);
+  GEN X2 = Flm_sub(rowslice(A, 2, 2),
+                   Flm_Fl_mul(X1, ucoeff(L, 2, 1), p), p);
+  return vconcat(X1, X2);
+}
+
+/*
+  solve L*X = A,  L lower triangular with ones on the diagonal
+  (at least as many rows as columns)
+*/
+static GEN
+Flm_rsolve_lower_unit(GEN L, GEN A, ulong p) {
+  long m = lg(L) - 1, m1, n;
+  GEN L1, L2, L11, L21, L22, A1, A2, X1, X2, X;
+  pari_sp av = avma;
+
+  if (m == 0)
+    return zero_Flm(0, lg(A) - 1);
+  if (m == 1)
+    return Flm_rsolve_lower_unit_1(L, A, p);
+  if (m == 2)
+    return Flm_rsolve_lower_unit_2(L, A, p);
+  m1 = (m + 1)/2;
+  n = nbrows(L);
+  L1 = vecslice(L, 1, m1);
+  L2 = vecslice(L, m1 + 1, m);
+  L11 = rowslice(L1, 1, m1);
+  L21 = rowslice(L1, m1 + 1, n);
+  L22 = rowslice(L2, m1 + 1, n);
+  A1 = rowslice(A, 1, m1);
+  A2 = rowslice(A, m1 + 1, n);
+  X1 = Flm_rsolve_lower_unit(L11, A1, p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 4, &A2, &L21, &L22, &X1);
+  A2 = Flm_sub(A2, Flm_mul(L21, X1, p), p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 3, &A2, &L22, &X1);
+  X2 = Flm_rsolve_lower_unit(L22, A2, p);
+  X = vconcat(X1, X2);
+  if (gc_needed(av, 1))
+    X = gerepilecopy(av, X);
+  return X;
+}
+
+static GEN
+Flm_lsolve_lower_unit_2(GEN L, GEN A, ulong p) {
+  GEN X2 = vecslice(A, 2, 2);
+  GEN X1 = Flm_sub(vecslice(A, 1, 1),
+                   Flm_Fl_mul(X2, ucoeff(L, 2, 1), p), p);
+  return shallowconcat(X1, X2);
+}
+
+/* solve L*X = A,  L square lower triangular with ones on the diagonal */
+static GEN
+Flm_lsolve_lower_unit(GEN L, GEN A, ulong p) {
+  long m = lg(L) - 1, m1;
+  GEN L1, L2, L11, L21, L22, A1, A2, X1, X2, X;
+  pari_sp av = avma;
+
+  if (m <= 1)
+    return A;
+  if (m == 2)
+    return Flm_lsolve_lower_unit_2(L, A, p);
+  m1 = (m + 1)/2;
+  L1 = vecslice(L, 1, m1);
+  L2 = vecslice(L, m1 + 1, m);
+  L11 = rowslice(L1, 1, m1);
+  L21 = rowslice(L1, m1 + 1, m);
+  L22 = rowslice(L2, m1 + 1, m);
+  A1 = vecslice(A, 1, m1);
+  A2 = vecslice(A, m1 + 1, m);
+  X2 = Flm_lsolve_lower_unit(L22, A2, p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 4, &A1, &L11, &L21, &X2);
+  A1 = Flm_sub(A1, Flm_mul(X2, L21, p), p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 3, &A1, &L11, &X2);
+  X1 = Flm_lsolve_lower_unit(L11, A1, p);
+  X = shallowconcat(X1, X2);
+  if (gc_needed(av, 1))
+    X = gerepilecopy(av, X);
+  return X;
+}
+
+/* destroy A */
+static long
+Flm_CUP_gauss(GEN A, GEN *R, GEN *C, GEN *U, GEN *P, ulong p) {
+  long i, j, k, m = nbrows(A), n = lg(A) - 1, pr, pc, u, v;
+  pari_sp av;
+
+  *P = identity_perm(n);
+  *R = cgetg(m + 1, t_VECSMALL);
+  av = avma;
+  for (j = 1, pr = 0; j <= n; j++) {
+    for (pr++, pc = 0; pr <= m; pr++) {
+      for (k = j; k <= n; k++) {
+        v = ucoeff(A, pr, k);
+        if (!pc && v)
+          pc = k;
+      }
+      if (pc)
+        break;
+    }
+    if (!pc)
+      break;
+    (*R)[j] = pr;
+    if (pc != j) {
+      swap(gel(A, j), gel(A, pc));
+      lswap((*P)[j], (*P)[pc]);
+    }
+    if (pr != j) {
+      for (k = j; k <= n; k++)
+        ucoeff(A, j, k) = ucoeff(A, pr, k);
+    }
+    u = Fl_inv(ucoeff(A, j, j), p);
+    for (i = pr + 1; i <= m; i++) {
+      v = Fl_mul(ucoeff(A, i, j), u, p);
+      ucoeff(A, i, j) = v;
+      for (k = j + 1; k <= n; k++) {
+        ucoeff(A, i, k) = Fl_sub(ucoeff(A, i, k),
+                                 Fl_mul(ucoeff(A, j, k), v, p), p);
+      }
+    }
+  }
+  setlg(*R, j);
+  *C = vecslice(A, 1, j - 1);
+  *U = rowslice(A, 1, j - 1);
+  if (gc_needed(av, 1))
+    gerepileall(av, 2, C, U);
+  return j - 1;
+}
+
+static const long Flm_CUP_LIMIT = 8;
+
+static ulong
+Flm_CUP(GEN A, GEN *R, GEN *C, GEN *U, GEN *P, ulong p) {
+  long m = nbrows(A), m1, n = lg(A) - 1, i, r1, r2, r;
+  GEN R1, C1, U1, P1, R2, C2, U2, P2;
+  GEN A1, A2, B2, C21, U11, U12, T21, T22;
+  pari_sp av = avma;
+
+  if (m < Flm_CUP_LIMIT || n < Flm_CUP_LIMIT)
+    /* destroy A; not called at the outermost recursion level */
+    return Flm_CUP_gauss(A, R, C, U, P, p);
+  m1 = (minss(m, n) + 1)/2;
+  A1 = rowslice(A, 1, m1);
+  A2 = rowslice(A, m1 + 1, m);
+  r1 = Flm_CUP(A1, &R1, &C1, &U1, &P1, p);
+  if (r1 == 0) {
+    r2 = Flm_CUP(A2, &R2, &C2, &U2, &P2, p);
+    *R = cgetg(r2 + 1, t_VECSMALL);
+    for (i = 1; i <= r2; i++)
+      (*R)[i] = R2[i] + m1;
+    *C = vconcat(zero_Flm(m1, r2), C2);
+    *U = U2;
+    *P = P2;
+    r = r2;
+  } else {
+    U11 = vecslice(U1, 1, r1);
+    U12 = vecslice(U1, r1 + 1, n);
+    T21 = vecslicepermute(A2, P1, 1, r1);
+    T22 = vecslicepermute(A2, P1, r1 + 1, n);
+    C21 = Flm_lsolve_upper(U11, T21, p);
+    if (gc_needed(av, 1))
+      gerepileall(av, 7, &R1, &C1, &P1, &U11, &U12, &T22, &C21);
+    B2 = Flm_sub(T22, Flm_mul(C21, U12, p), p);
+    r2 = Flm_CUP(B2, &R2, &C2, &U2, &P2, p);
+    r = r1 + r2;
+    *R = cgetg(r + 1, t_VECSMALL);
+    for (i = 1; i <= r1; i++)
+      (*R)[i] = R1[i];
+    for (; i <= r; i++)
+      (*R)[i] = R2[i - r1] + m1;
+    *C = shallowconcat(vconcat(C1, C21),
+                       vconcat(zero_Flm(m1, r2), C2));
+    *U = shallowconcat(vconcat(U11, zero_Flm(r2, r1)),
+                       vconcat(vecpermute(U12, P2), U2));
+    *P = cgetg(n + 1, t_VECSMALL);
+    for (i = 1; i <= r1; i++)
+      (*P)[i] = P1[i];
+    for (; i <= n; i++)
+      (*P)[i] = P1[P2[i - r1] + r1];
+  }
+  if (gc_needed(av, 1))
+    gerepileall(av, 4, R, C, U, P);
+  return r;
+}
+
+static ulong
+Flm_echelon_gauss(GEN A, GEN *R, GEN *C, ulong p) {
+  GEN U, P;
+  return Flm_CUP_gauss(A, R, C, &U, &P, p);
+}
+
+/* column echelon form */
+static ulong
+Flm_echelon(GEN A, GEN *R, GEN *C, ulong p) {
+  long i, j, j1, j2, m = nbrows(A), n = lg(A) - 1, n1, r, r1, r2;
+  GEN A1, A2, R1, R1c, C1, R2, C2;
+  GEN A12, A22, B2, C11, C21, M12;
+  pari_sp av = avma;
+
+  if (m < Flm_CUP_LIMIT || n < Flm_CUP_LIMIT)
+    return Flm_echelon_gauss(Flm_copy(A), R, C, p);
+
+  n1 = (n + 1)/2;
+  A1 = vecslice(A, 1, n1);
+  A2 = vecslice(A, n1 + 1, n);
+  r1 = Flm_echelon(A1, &R1, &C1, p);
+  if (!r1)
+    return Flm_echelon(A2, R, C, p);
+  else if (r1 == m) {
+    *R = R1;
+    *C = C1;
+    return r1;
+  }
+  R1c = indexcompl(R1, m);
+  C11 = rowpermute(C1, R1);
+  C21 = rowpermute(C1, R1c);
+  if (R1[r1] != r1) {
+    for (j = 1; j <= r1; j++)
+      for (i = 1; i <= R1[j] - j; i++)
+        ucoeff(C21, i, j) = 0;
+  }
+  A12 = rowpermute(A2, R1);
+  A22 = rowpermute(A2, R1c);
+  M12 = Flm_rsolve_lower_unit(C11, A12, p);
+  B2 = Flm_sub(A22, Flm_mul(C21, M12, p), p);
+  r2 = Flm_echelon(B2, &R2, &C2, p);
+  if (!r2) {
+    *R = R1;
+    *C = C1;
+    r = r1;
+  } else {
+    R2 = perm_mul(R1c, R2);
+    C2 = rowpermute(vconcat(zero_Flm(r1, r2), C2),
+                    perm_inv(vecsmall_concat(R1, R1c)));
+    r = r1 + r2;
+    *R = cgetg(r + 1, t_VECSMALL);
+    *C = cgetg(r + 1, t_MAT);
+    for (j = j1 = j2 = 1; j <= r; j++) {
+      if (j2 > r2 || (j1 <= r1 && R1[j1] < R2[j2])) {
+        gel(*C, j) = gel(C1, j1);
+        (*R)[j] = R1[j1++];
+      } else {
+        gel(*C, j) = gel(C2, j2);
+        (*R)[j] = R2[j2++];
+      }
+    }
+  }
+  if (gc_needed(av, 1))
+    gerepileall(av, 2, R, C);
+  return r;
+}
+
+static GEN
+FlxqM_rsolve_upper_1(GEN U, GEN B, GEN T, ulong p) {
+  return FlxqM_Flxq_mul(B, Flxq_inv(gcoeff(U, 1, 1), T, p), T, p);
+}
+
+static GEN
+FlxqM_rsolve_upper_2(GEN U, GEN B, GEN T, ulong p) {
+  GEN a = gcoeff(U, 1, 1), b = gcoeff(U, 1, 2), d = gcoeff(U, 2, 2);
+  GEN D = Flxq_mul(a, d, T, p), Dinv = Flxq_inv(D, T, p);
+  GEN ainv = Flxq_mul(d, Dinv, T, p), dinv = Flxq_mul(a, Dinv, T, p);
+  GEN B1 = rowslice(B, 1, 1);
+  GEN B2 = rowslice(B, 2, 2);
+  GEN X2 = FlxqM_Flxq_mul(B2, dinv, T, p);
+  GEN X1 = FlxqM_Flxq_mul(FlxM_sub(B1, FlxqM_Flxq_mul(X2, b, T, p), p),
+                          ainv, T, p);
+  return vconcat(X1, X2);
+}
+
+/* solve U*X = B,  U upper triangular and invertible */
+static GEN
+FlxqM_rsolve_upper(GEN U, GEN B, GEN T, ulong p) {
+  long n = lg(U) - 1, n1;
+  GEN U1, U2, U11, U12, U22;
+  GEN B1, B2, X1, X2, X;
+  pari_sp av = avma;
+
+  if (n == 0)
+    return B;
+  if (n == 1)
+    return FlxqM_rsolve_upper_1(U, B, T, p);
+  if (n == 2)
+    return FlxqM_rsolve_upper_2(U, B, T, p);
+  n1 = (n + 1)/2;
+  U1 = vecslice(U, 1, n1);
+  U2 = vecslice(U, n1 + 1, n);
+  U11 = rowslice(U1, 1, n1);
+  U12 = rowslice(U2, 1, n1);
+  U22 = rowslice(U2, n1 + 1, n);
+  B1 = rowslice(B, 1, n1);
+  B2 = rowslice(B, n1 + 1, n);
+  X2 = FlxqM_rsolve_upper(U22, B2, T, p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 4, &B1, &U11, &U12, &X2);
+  B1 = FlxM_sub(B1, FlxqM_mul(U12, X2, T, p), p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 3, &B1, &U11, &X2);
+  X1 = FlxqM_rsolve_upper(U11, B1, T, p);
+  X = vconcat(X1, X2);
+  if (gc_needed(av, 1))
+    X = gerepilecopy(av, X);
+  return X;
+}
+
+static GEN
+FlxqM_lsolve_upper_1(GEN U, GEN B, GEN T, ulong p) {
+  return FlxqM_Flxq_mul(B, Flxq_inv(gcoeff(U, 1, 1), T, p), T, p);
+}
+
+static GEN
+FlxqM_lsolve_upper_2(GEN U, GEN B, GEN T, ulong p) {
+  GEN a = gcoeff(U, 1, 1), b = gcoeff(U, 1, 2), d = gcoeff(U, 2, 2);
+  GEN D = Flxq_mul(a, d, T, p), Dinv = Flxq_inv(D, T, p);
+  GEN ainv = Flxq_mul(d, Dinv, T, p), dinv = Flxq_mul(a, Dinv, T, p);
+  GEN B1 = vecslice(B, 1, 1);
+  GEN B2 = vecslice(B, 2, 2);
+  GEN X1 = FlxqM_Flxq_mul(B1, ainv, T, p);
+  GEN X2 = FlxqM_Flxq_mul(FlxM_sub(B2, FlxqM_Flxq_mul(X1, b, T, p), p),
+                          dinv, T, p);
+  return shallowconcat(X1, X2);
+}
+
+/* solve X*U = B,  U upper triangular and invertible */
+static GEN
+FlxqM_lsolve_upper(GEN U, GEN B, GEN T, ulong p) {
+  long n = lg(U) - 1, n1;
+  GEN U1, U2, U11, U12, U22;
+  GEN B1, B2, X1, X2, X;
+  pari_sp av = avma;
+
+  if (n == 0)
+    return B;
+  if (n == 1)
+    return FlxqM_lsolve_upper_1(U, B, T, p);
+  if (n == 2)
+    return FlxqM_lsolve_upper_2(U, B, T, p);
+  n1 = (n + 1)/2;
+  U1 = vecslice(U, 1, n1);
+  U2 = vecslice(U, n1 + 1, n);
+  U11 = rowslice(U1, 1, n1);
+  U12 = rowslice(U2, 1, n1);
+  U22 = rowslice(U2, n1 + 1, n);
+  B1 = vecslice(B, 1, n1);
+  B2 = vecslice(B, n1 + 1, n);
+  X1 = FlxqM_lsolve_upper(U11, B1, T, p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 4, &B2, &U12, &U22, &X1);
+  B2 = FlxM_sub(B2, FlxqM_mul(X1, U12, T, p), p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 3, &B2, &U22, &X1);
+  X2 = FlxqM_lsolve_upper(U22, B2, T, p);
+  X = shallowconcat(X1, X2);
+  if (gc_needed(av, 1))
+    X = gerepilecopy(av, X);
+  return X;
+}
+
+static GEN
+FlxqM_rsolve_lower_unit_1(GEN L, GEN A, GEN T, ulong p) {
+  return rowslice(A, 1, 1);
+}
+
+static GEN
+FlxqM_rsolve_lower_unit_2(GEN L, GEN A, GEN T, ulong p) {
+  GEN X1 = rowslice(A, 1, 1);
+  GEN X2 = FlxM_sub(rowslice(A, 2, 2),
+                    FlxqM_Flxq_mul(X1, gcoeff(L, 2, 1), T, p), p);
+  return vconcat(X1, X2);
+}
+
+/*
+  solve L*X = A,  L lower triangular with ones on the diagonal
+  (at least as many rows as columns)
+*/
+static GEN
+FlxqM_rsolve_lower_unit(GEN L, GEN A, GEN T, ulong p) {
+  long m = lg(L) - 1, m1, n;
+  GEN L1, L2, L11, L21, L22, A1, A2, X1, X2, X;
+  pari_sp av = avma;
+
+  if (m == 0)
+    return zeromat(0, lg(A) - 1);
+  if (m == 1)
+    return FlxqM_rsolve_lower_unit_1(L, A, T, p);
+  if (m == 2)
+    return FlxqM_rsolve_lower_unit_2(L, A, T, p);
+  m1 = (m + 1)/2;
+  n = nbrows(L);
+  L1 = vecslice(L, 1, m1);
+  L2 = vecslice(L, m1 + 1, m);
+  L11 = rowslice(L1, 1, m1);
+  L21 = rowslice(L1, m1 + 1, n);
+  L22 = rowslice(L2, m1 + 1, n);
+  A1 = rowslice(A, 1, m1);
+  A2 = rowslice(A, m1 + 1, n);
+  X1 = FlxqM_rsolve_lower_unit(L11, A1, T, p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 4, &A2, &L21, &L22, &X1);
+  A2 = FlxM_sub(A2, FlxqM_mul(L21, X1, T, p), p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 3, &A2, &L22, &X1);
+  X2 = FlxqM_rsolve_lower_unit(L22, A2, T, p);
+  X = vconcat(X1, X2);
+  if (gc_needed(av, 1))
+    X = gerepilecopy(av, X);
+  return X;
+}
+
+static GEN
+FlxqM_lsolve_lower_unit_2(GEN L, GEN A, GEN T, ulong p) {
+  GEN X2 = vecslice(A, 2, 2);
+  GEN X1 = FlxM_sub(vecslice(A, 1, 1),
+                    FlxqM_Flxq_mul(X2, gcoeff(L, 2, 1), T, p), p);
+  return shallowconcat(X1, X2);
+}
+
+/* solve L*X = A,  L square lower triangular with ones on the diagonal */
+static GEN
+FlxqM_lsolve_lower_unit(GEN L, GEN A, GEN T, ulong p) {
+  long m = lg(L) - 1, m1;
+  GEN L1, L2, L11, L21, L22, A1, A2, X1, X2, X;
+  pari_sp av = avma;
+
+  if (m <= 1)
+    return A;
+  if (m == 2)
+    return FlxqM_lsolve_lower_unit_2(L, A, T, p);
+  m1 = (m + 1)/2;
+  L1 = vecslice(L, 1, m1);
+  L2 = vecslice(L, m1 + 1, m);
+  L11 = rowslice(L1, 1, m1);
+  L21 = rowslice(L1, m1 + 1, m);
+  L22 = rowslice(L2, m1 + 1, m);
+  A1 = vecslice(A, 1, m1);
+  A2 = vecslice(A, m1 + 1, m);
+  X2 = FlxqM_lsolve_lower_unit(L22, A2, T, p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 4, &A1, &L11, &L21, &X2);
+  A1 = FlxM_sub(A1, FlxqM_mul(X2, L21, T, p), p);
+  if (gc_needed(av, 1))
+    gerepileall(av, 3, &A1, &L11, &X2);
+  X1 = FlxqM_lsolve_lower_unit(L11, A1, T, p);
+  X = shallowconcat(X1, X2);
+  if (gc_needed(av, 1))
+    X = gerepilecopy(av, X);
+  return X;
+}
+
+/* destroy A */
+static long
+FlxqM_CUP_gauss(GEN A, GEN *R, GEN *C, GEN *U, GEN *P, GEN T, ulong p) {
+  long i, j, k, m = nbrows(A), n = lg(A) - 1, pr, pc;
+  pari_sp av;
+  GEN u, v;
+
+  *P = identity_perm(n);
+  *R = cgetg(m + 1, t_VECSMALL);
+  av = avma;
+  for (j = 1, pr = 0; j <= n; j++) {
+    for (pr++, pc = 0; pr <= m; pr++) {
+      for (k = j; k <= n; k++) {
+        v = Flx_rem(gcoeff(A, pr, k), T, p);
+        gcoeff(A, pr, k) = v;
+        if (!pc && lgpol(v) > 0)
+          pc = k;
+      }
+      if (pc)
+        break;
+    }
+    if (!pc)
+      break;
+    (*R)[j] = pr;
+    if (pc != j) {
+      swap(gel(A, j), gel(A, pc));
+      lswap((*P)[j], (*P)[pc]);
+    }
+    if (pr != j) {
+      for (k = j; k <= n; k++)
+        gcoeff(A, j, k) = gcoeff(A, pr, k);
+    }
+    u = Flxq_inv(gcoeff(A, j, j), T, p);
+    for (i = pr + 1; i <= m; i++) {
+      v = Flxq_mul(gcoeff(A, i, j), u, T, p);
+      gcoeff(A, i, j) = v;
+      for (k = j + 1; k <= n; k++) {
+        gcoeff(A, i, k) = Flx_sub(gcoeff(A, i, k),
+                                  Flx_mul(gcoeff(A, j, k), v, p), p);
+      }
+    }
+  }
+  setlg(*R, j);
+  *C = vecslice(A, 1, j - 1);
+  *U = rowslice(A, 1, j - 1);
+  if (gc_needed(av, 1))
+    gerepileall(av, 2, C, U);
+  return j - 1;
+}
+
+static const long FlxqM_CUP_LIMIT = 5;
+
+static ulong
+FlxqM_CUP(GEN A, GEN *R, GEN *C, GEN *U, GEN *P, GEN T, ulong p) {
+  long m = nbrows(A), m1, n = lg(A) - 1, i, r1, r2, r, sv;
+  GEN R1, C1, U1, P1, R2, C2, U2, P2;
+  GEN A1, A2, B2, C21, U11, U12, T21, T22;
+  pari_sp av = avma;
+
+  if (m < FlxqM_CUP_LIMIT || n < FlxqM_CUP_LIMIT)
+    /* destroy A; not called at the outermost recursion level */
+    return FlxqM_CUP_gauss(A, R, C, U, P, T, p);
+  sv = get_Flx_var(T);
+  m1 = (minss(m, n) + 1)/2;
+  A1 = rowslice(A, 1, m1);
+  A2 = rowslice(A, m1 + 1, m);
+  r1 = FlxqM_CUP(A1, &R1, &C1, &U1, &P1, T, p);
+  if (r1 == 0) {
+    r2 = FlxqM_CUP(A2, &R2, &C2, &U2, &P2, T, p);
+    *R = cgetg(r2 + 1, t_VECSMALL);
+    for (i = 1; i <= r2; i++)
+      (*R)[i] = R2[i] + m1;
+    *C = vconcat(zero_FlxM(m1, r2, sv), C2);
+    *U = U2;
+    *P = P2;
+    r = r2;
+  } else {
+    U11 = vecslice(U1, 1, r1);
+    U12 = vecslice(U1, r1 + 1, n);
+    T21 = vecslicepermute(A2, P1, 1, r1);
+    T22 = vecslicepermute(A2, P1, r1 + 1, n);
+    C21 = FlxqM_lsolve_upper(U11, T21, T, p);
+    if (gc_needed(av, 1))
+      gerepileall(av, 7, &R1, &C1, &P1, &U11, &U12, &T22, &C21);
+    B2 = FlxM_sub(T22, FlxqM_mul(C21, U12, T, p), p);
+    r2 = FlxqM_CUP(B2, &R2, &C2, &U2, &P2, T, p);
+    r = r1 + r2;
+    *R = cgetg(r + 1, t_VECSMALL);
+    for (i = 1; i <= r1; i++)
+      (*R)[i] = R1[i];
+    for (; i <= r; i++)
+      (*R)[i] = R2[i - r1] + m1;
+    *C = shallowmatconcat(mkmat2(mkcol2(C1, C21),
+                                 mkcol2(zero_FlxM(m1, r2, sv), C2)));
+    *U = shallowmatconcat(mkmat2(mkcol2(U11, zero_FlxM(r2, r1, sv)),
+                                 mkcol2(vecpermute(U12, P2), U2)));
+    *P = cgetg(n + 1, t_VECSMALL);
+    for (i = 1; i <= r1; i++)
+      (*P)[i] = P1[i];
+    for (; i <= n; i++)
+      (*P)[i] = P1[P2[i - r1] + r1];
+  }
+  if (gc_needed(av, 1))
+    gerepileall(av, 4, R, C, U, P);
+  return r;
+}
+
+static ulong
+FlxqM_echelon_gauss(GEN A, GEN *R, GEN *C, GEN T, ulong p) {
+  GEN U, P;
+  return FlxqM_CUP_gauss(A, R, C, &U, &P, T, p);
+}
+
+/* column echelon form */
+static ulong
+FlxqM_echelon(GEN A, GEN *R, GEN *C, GEN T, ulong p) {
+  long i, j, j1, j2, m = nbrows(A), n = lg(A) - 1, n1, r, r1, r2;
+  GEN A1, A2, R1, R1c, C1, R2, C2;
+  GEN A12, A22, B2, C11, C21, M12;
+  pari_sp av = avma;
+
+  if (m < FlxqM_CUP_LIMIT || n < FlxqM_CUP_LIMIT)
+    return FlxqM_echelon_gauss(shallowcopy(A), R, C, T, p);
+
+  n1 = (n + 1)/2;
+  A1 = vecslice(A, 1, n1);
+  A2 = vecslice(A, n1 + 1, n);
+  r1 = FlxqM_echelon(A1, &R1, &C1, T, p);
+  if (!r1)
+    return FlxqM_echelon(A2, R, C, T, p);
+  else if (r1 == m) {
+    *R = R1;
+    *C = C1;
+    return r1;
+  }
+  R1c = indexcompl(R1, m);
+  C11 = rowpermute(C1, R1);
+  C21 = rowpermute(C1, R1c);
+  if (R1[r1] != r1) {
+    GEN zero = zero_Flx(get_Flx_var(T));
+    for (j = 1; j <= r1; j++)
+      for (i = 1; i <= R1[j] - j; i++)
+        gcoeff(C21, i, j) = zero;
+  }
+  A12 = rowpermute(A2, R1);
+  A22 = rowpermute(A2, R1c);
+  M12 = FlxqM_rsolve_lower_unit(C11, A12, T, p);
+  B2 = FlxM_sub(A22, FlxqM_mul(C21, M12, T, p), p);
+  r2 = FlxqM_echelon(B2, &R2, &C2, T, p);
+  if (!r2) {
+    *R = R1;
+    *C = C1;
+    r = r1;
+  } else {
+    R2 = perm_mul(R1c, R2);
+    C2 = rowpermute(vconcat(zero_FlxM(r1, r2, get_Flx_var(T)), C2),
+                    perm_inv(vecsmall_concat(R1, R1c)));
+    r = r1 + r2;
+    *R = cgetg(r + 1, t_VECSMALL);
+    *C = cgetg(r + 1, t_MAT);
+    for (j = j1 = j2 = 1; j <= r; j++) {
+      if (j2 > r2 || (j1 <= r1 && R1[j1] < R2[j2])) {
+        gel(*C, j) = gel(C1, j1);
+        (*R)[j] = R1[j1++];
+      } else {
+        gel(*C, j) = gel(C2, j2);
+        (*R)[j] = R2[j2++];
+      }
+    }
+  }
+  if (gc_needed(av, 1))
+    gerepileall(av, 2, R, C);
+  return r;
+}
+
+
+/*******************************************************************/
+/*                                                                 */
 /*                    LINEAR ALGEBRA MODULO P                      */
 /*                                                                 */
 /*******************************************************************/
@@ -623,7 +1396,7 @@ _Fl_add_OK(GEN b, long k, long i, ulong p)
 }
 
 static GEN
-Flm_ker_sp_OK(GEN x, ulong p, long deplin)
+Flm_ker_gauss_OK(GEN x, ulong p, long deplin)
 {
   GEN y, c, d;
   long i, j, k, r, t, m, n;
@@ -698,15 +1471,15 @@ Flm_ker_sp_OK(GEN x, ulong p, long deplin)
 }
 
 /* in place, destroy x */
-GEN
-Flm_ker_sp(GEN x, ulong p, long deplin)
+static GEN
+Flm_ker_gauss(GEN x, ulong p, long deplin)
 {
   GEN y, c, d;
   long i, j, k, r, t, m, n;
   ulong a, pi;
   n = lg(x)-1;
   if (!n) return cgetg(1,t_MAT);
-  if (SMALL_ULONG(p)) return Flm_ker_sp_OK(x, p, deplin);
+  if (SMALL_ULONG(p)) return Flm_ker_gauss_OK(x, p, deplin);
   pi = get_Fl_red(p);
 
   m=nbrows(x); r=0;
@@ -807,10 +1580,75 @@ GEN
 F2m_ker(GEN x) { return F2m_ker_sp(F2m_copy(x), 0); }
 GEN
 F2m_deplin(GEN x) { return F2m_ker_sp(F2m_copy(x), 1); }
+
+static GEN
+Flm_ker_echelon(GEN x, ulong p, long pivots) {
+  pari_sp av = avma;
+  GEN R, Rc, C, C1, C2, S, K;
+  long n = lg(x) - 1, r;
+  r = Flm_echelon(Flm_transpose(x), &R, &C, p);
+  Rc = indexcompl(R, n);
+  C1 = rowpermute(C, R);
+  C2 = rowpermute(C, Rc);
+  if (R[r] != r) {
+    long i, j;
+    for (j = 1; j <= r; j++)
+      for (i = 1; i <= R[j] - j; i++)
+        ucoeff(C2, i, j) = 0;
+  }
+  S = Flm_lsolve_lower_unit(C1, C2, p);
+  K = vecpermute(shallowconcat(Flm_neg(S, p), matid_Flm(n - r)),
+                 perm_inv(vecsmall_concat(R, Rc)));
+  K = Flm_transpose(K);
+  if (pivots)
+    return gerepilecopy(av, mkvec2(K, R));
+  return gerepileupto(av, K);
+}
+
+static GEN
+Flm_deplin_echelon(GEN x, ulong p) {
+  pari_sp av = avma;
+  GEN R, Rc, C, C1, C2, s, v;
+  long i, j, n = lg(x) - 1, r;
+  r = Flm_echelon(Flm_transpose(x), &R, &C, p);
+  if (r == n) { avma = av; return NULL; }
+  Rc = indexcompl(R, n);
+  i = Rc[1];
+  C1 = rowpermute(C, R);
+  C2 = rowslice(C, i, i);
+  for (j = i; j <= r; j++)
+    ucoeff(C2, 1, j) = 0UL;
+  s = Flm_row(Flm_lsolve_lower_unit(C1, C2, p), 1);
+  v = vecpermute(vecsmall_concat(Flv_neg(s, p), vecsmall_ei(n - r, 1)),
+                 perm_inv(vecsmall_concat(R, Rc)));
+  return gerepileuptoleaf(av, v);
+}
+
+static GEN
+Flm_ker_i(GEN x, ulong p, long deplin, long inplace) {
+  if (lg(x) - 1 >= Flm_CUP_LIMIT && nbrows(x) >= Flm_CUP_LIMIT)
+    switch(deplin) {
+    case 0: return Flm_ker_echelon(x, p, 0);
+    case 1: return Flm_deplin_echelon(x, p);
+    case 2: return Flm_ker_echelon(x, p, 1);
+    }
+  return Flm_ker_gauss(inplace? x: Flm_copy(x), p, deplin);
+}
+
 GEN
-Flm_ker(GEN x, ulong p) { return Flm_ker_sp(Flm_copy(x), p, 0); }
+Flm_ker_sp(GEN x, ulong p, long deplin) {
+  return Flm_ker_i(x, p, deplin, 1);
+}
+
 GEN
-Flm_deplin(GEN x, ulong p) { return Flm_ker_sp(Flm_copy(x), p, 1); }
+Flm_ker(GEN x, ulong p) {
+  return Flm_ker_i(x, p, 0, 0);
+}
+
+GEN
+Flm_deplin(GEN x, ulong p) {
+  return Flm_ker_i(x, p, 1, 0);
+}
 
 ulong
 F2m_det_sp(GEN x) { return !F2m_ker_sp(x, 1); }
@@ -825,7 +1663,7 @@ F2m_det(GEN x)
 
 /* in place, destroy a, SMALL_ULONG(p) is TRUE */
 static ulong
-Flm_det_sp_OK(GEN a, long nbco, ulong p)
+Flm_det_gauss_OK(GEN a, long nbco, ulong p)
 {
   long i,j,k, s = 1;
   ulong q, x = 1;
@@ -871,14 +1709,15 @@ Flm_det_sp_OK(GEN a, long nbco, ulong p)
   if (s < 0 && x) x = p - x;
   return x;
 }
+
 /* in place, destroy a */
-ulong
-Flm_det_sp(GEN a, ulong p)
+static ulong
+Flm_det_gauss(GEN a, ulong p)
 {
   long i,j,k, s = 1, nbco = lg(a)-1;
   ulong pi, q, x = 1;
 
-  if (SMALL_ULONG(p)) return Flm_det_sp_OK(a, nbco, p);
+  if (SMALL_ULONG(p)) return Flm_det_gauss_OK(a, nbco, p);
   pi = get_Fl_red(p);
   for (i=1; i<nbco; i++)
   {
@@ -908,14 +1747,42 @@ Flm_det_sp(GEN a, ulong p)
   return Fl_mul(x, ucoeff(a,nbco,nbco), p);
 }
 
-ulong
-Flm_det(GEN x, ulong p)
-{
+static ulong
+Flm_det_CUP(GEN a, ulong p) {
+  GEN R, C, U, P;
+  long d, i, n = lg(a) - 1, r;
+  r = Flm_CUP(a, &R, &C, &U, &P, p);
+  if (r < n)
+    d = 0;
+  else {
+    d = perm_sign(P) == 1? 1: p-1;
+    for (i = 1; i <= n; i++)
+      d = Fl_mul(d, ucoeff(U, i, i), p);
+  }
+  return d;
+}
+
+static ulong
+Flm_det_i(GEN x, ulong p, long inplace) {
   pari_sp av = avma;
-  ulong d = Flm_det_sp(Flm_copy(x), p);
+  ulong d;
+  if (lg(x) - 1 >= Flm_CUP_LIMIT)
+    d = Flm_det_CUP(x, p);
+  else
+    d = Flm_det_gauss(inplace? x: Flm_copy(x), p);
   avma = av; return d;
 }
 
+ulong
+Flm_det_sp(GEN x, ulong p) {
+  return Flm_det_i(x, p, 1);
+}
+
+ulong
+Flm_det(GEN x, ulong p) {
+  return Flm_det_i(x, p, 0);
+}
+
 static GEN
 FpM_init(GEN a, GEN p, ulong *pp)
 {
@@ -1036,6 +1903,27 @@ Flm_gauss_pivot(GEN x, ulong p, long *rr)
 }
 
 static GEN
+Flm_pivots_CUP(GEN x, ulong p, long *rr) {
+  pari_sp av;
+  long i, n = lg(x) - 1, r;
+  GEN R, C, U, P, d = zero_zv(n);
+  av = avma;
+  r = Flm_CUP(x, &R, &C, &U, &P, p);
+  for(i = 1; i <= r; i++)
+    d[P[i]] = R[i];
+  avma = av;
+  *rr = n - r;
+  return d;
+}
+
+static GEN
+Flm_pivots(GEN x, ulong p, long *rr, long inplace) {
+  if (lg(x) - 1 >= Flm_CUP_LIMIT && nbrows(x) >= Flm_CUP_LIMIT)
+    return Flm_pivots_CUP(x, p, rr);
+  return Flm_gauss_pivot(inplace? x: Flm_copy(x), p, rr);
+}
+
+static GEN
 FpM_gauss_pivot_gen(GEN x, GEN p, long *rr)
 {
   void *E;
@@ -1052,7 +1940,7 @@ FpM_gauss_pivot(GEN x, GEN p, long *rr)
   {
   case 0: return FpM_gauss_pivot_gen(x, p, rr);
   case 2: return F2m_gauss_pivot(x, rr);
-  default:return Flm_gauss_pivot(x, pp, rr);
+  default:return Flm_pivots(x, pp, rr, 1);
   }
 }
 
@@ -1063,13 +1951,15 @@ FpM_image(GEN x, GEN p)
   GEN d = FpM_gauss_pivot(x,p,&r); /* d left on stack for efficiency */
   return image_from_pivot(x,d,r);
 }
+
 GEN
 Flm_image(GEN x, ulong p)
 {
   long r;
-  GEN d = Flm_gauss_pivot(Flm_copy(x),p,&r); /* d left on stack for efficiency */
+  GEN d = Flm_pivots(x, p, &r, 0); /* d left on stack for efficiency */
   return image_from_pivot(x,d,r);
 }
+
 GEN
 F2m_image(GEN x)
 {
@@ -1086,14 +1976,21 @@ FpM_rank(GEN x, GEN p)
   (void)FpM_gauss_pivot(x,p,&r);
   avma = av; return lg(x)-1 - r;
 }
+
 long
 Flm_rank(GEN x, ulong p)
 {
   pari_sp av = avma;
   long r;
-  (void)Flm_gauss_pivot(Flm_copy(x),p,&r);
+  if (lg(x) - 1 >= Flm_CUP_LIMIT && nbrows(x) >= Flm_CUP_LIMIT) {
+    GEN R, C;
+    r = Flm_echelon(x, &R, &C, p);
+    avma = av; return r;
+  }
+  (void) Flm_pivots(x, p, &r, 0);
   avma = av; return lg(x)-1 - r;
 }
+
 long
 F2m_rank(GEN x)
 {
@@ -1103,7 +2000,6 @@ F2m_rank(GEN x)
   avma = av; return lg(x)-1 - r;
 }
 
-
 static GEN
 FlxqM_gauss_pivot(GEN x, GEN T, ulong p, long *rr)
 {
@@ -1112,11 +2008,32 @@ FlxqM_gauss_pivot(GEN x, GEN T, ulong p, long *rr)
   return gen_Gauss_pivot(x, rr, E, S);
 }
 
+static GEN
+FlxqM_pivots_CUP(GEN x, GEN T, ulong p, long *rr) {
+  pari_sp av;
+  long i, n = lg(x) - 1, r;
+  GEN R, C, U, P, d = zero_zv(n);
+  av = avma;
+  r = FlxqM_CUP(x, &R, &C, &U, &P, T, p);
+  for(i = 1; i <= r; i++)
+    d[P[i]] = R[i];
+  avma = av;
+  *rr = n - r;
+  return d;
+}
+
+static GEN
+FlxqM_pivots(GEN x, GEN T, ulong p, long *rr) {
+  if (lg(x) - 1 >= FlxqM_CUP_LIMIT && nbrows(x) >= FlxqM_CUP_LIMIT)
+    return FlxqM_pivots_CUP(x, T, p, rr);
+  return FlxqM_gauss_pivot(x, T, p, rr);
+}
+
 GEN
 FlxqM_image(GEN x, GEN T, ulong p)
 {
   long r;
-  GEN d = FlxqM_gauss_pivot(x,T,p,&r); /* d left on stack for efficiency */
+  GEN d = FlxqM_pivots(x, T, p, &r); /* d left on stack for efficiency */
   return image_from_pivot(x,d,r);
 }
 
@@ -1125,17 +2042,47 @@ FlxqM_rank(GEN x, GEN T, ulong p)
 {
   pari_sp av = avma;
   long r;
-  (void)FlxqM_gauss_pivot(x,T,p,&r);
+  if (lg(x) - 1 >= FlxqM_CUP_LIMIT && nbrows(x) >= FlxqM_CUP_LIMIT) {
+    GEN R, C;
+    r = FlxqM_echelon(x, &R, &C, T, p);
+    avma = av; return r;
+  }
+  (void) FlxqM_pivots(x, T, p, &r);
   avma = av; return lg(x)-1 - r;
 }
-GEN
-FlxqM_det(GEN a, GEN T, ulong p)
+
+static GEN
+FlxqM_det_gen(GEN a, GEN T, ulong p)
 {
   void *E;
   const struct bb_field *S = get_Flxq_field(&E, T, p);
   return gen_det(a, E, S);
 }
 
+static GEN
+FlxqM_det_CUP(GEN a, GEN T, ulong p) {
+  pari_sp av = avma;
+  GEN R, C, U, P, d;
+  long i, n = lg(a) - 1, r, sv = get_Flx_var(T);
+  r = FlxqM_CUP(a, &R, &C, &U, &P, T, p);
+  if (r < n)
+    d = pol0_Flx(sv);
+  else {
+    d = mkvecsmall2(sv, perm_sign(P) == 1? 1: p - 1);
+    for (i = 1; i <= n; i++)
+      d = Flxq_mul(d, gcoeff(U, i, i), T, p);
+  }
+  return gerepileuptoleaf(av, d);
+}
+
+GEN
+FlxqM_det(GEN a, GEN T, ulong p) {
+  if (lg(a) - 1 >= FlxqM_CUP_LIMIT)
+    return FlxqM_det_CUP(a, T, p);
+  else
+    return FlxqM_det_gen(a, T, p);
+}
+
 GEN
 FlxqM_FlxqC_invimage(GEN A, GEN B, GEN T, ulong p) {
   void *E;
@@ -1164,13 +2111,50 @@ FlxqM_mul(GEN A, GEN B, GEN T, ulong p) {
   return gen_matmul(A, B, E, ff);
 }
 
-GEN
-FlxqM_invimage(GEN A, GEN B, GEN T, ulong p) {
+static GEN
+FlxqM_invimage_gen(GEN A, GEN B, GEN T, ulong p) {
   void *E;
   const struct bb_field *ff = get_Flxq_field(&E, T, p);
   return gen_matinvimage(A, B, E, ff);
 }
 
+static GEN
+FlxqM_invimage_CUP(GEN A, GEN B, GEN T, ulong p) {
+  pari_sp av = avma;
+  GEN R, Rc, C, U, P, B1, B2, C1, C2, X, Y, Z;
+  long r, sv = get_Flx_var(T);
+  r = FlxqM_CUP(A, &R, &C, &U, &P, T, p);
+  Rc = indexcompl(R, nbrows(B));
+  C1 = rowpermute(C, R);
+  C2 = rowpermute(C, Rc);
+  if (R[r] != r) {
+    long i, j;
+    GEN zero = zero_Flx(sv);
+    for (j = 1; j <= r; j++)
+      for (i = 1; i <= R[j] - j; i++)
+        gcoeff(C2, i, j) = zero;
+  }
+  B1 = rowpermute(B, R);
+  B2 = rowpermute(B, Rc);
+  Z = FlxqM_rsolve_lower_unit(C1, B1, T, p);
+  if (!gequal(FlxqM_mul(C2, Z, T, p), B2))
+    return NULL;
+  Y = vconcat(FlxqM_rsolve_upper(vecslice(U, 1, r), Z, T, p),
+              zero_FlxM(lg(A) - 1 - r, lg(B) - 1, sv));
+  X = rowpermute(Y, perm_inv(P));
+  return gerepilecopy(av, X);
+}
+
+GEN
+FlxqM_invimage(GEN A, GEN B, GEN T, ulong p) {
+  long nA = lg(A)-1, nB = lg(B)-1;
+
+  if (!nB) return cgetg(1, t_MAT);
+  if (nA + nB >= FlxqM_CUP_LIMIT && nbrows(B) >= FlxqM_CUP_LIMIT)
+    return FlxqM_invimage_CUP(A, B, T, p);
+  return FlxqM_invimage_gen(A, B, T, p);
+}
+
 GEN
 F2xqM_det(GEN a, GEN T)
 {
@@ -1395,7 +2379,7 @@ GEN
 FqM_deplin(GEN x, GEN T, GEN p) { return FqM_ker_i(x,T,p,1); }
 
 static GEN
-FlxqM_ker_i(GEN x, GEN T, ulong p, long deplin)
+FlxqM_ker_gen(GEN x, GEN T, ulong p, long deplin)
 {
   const struct bb_field *ff;
   void *E;
@@ -1405,6 +2389,66 @@ FlxqM_ker_i(GEN x, GEN T, ulong p, long deplin)
   return gen_ker(x,deplin, E, ff);
 }
 
+static GEN
+FlxqM_ker_echelon(GEN x, GEN T, ulong p) {
+  pari_sp av = avma;
+  GEN R, Rc, C, C1, C2, S, K;
+  long n = lg(x) - 1, r;
+  r = FlxqM_echelon(shallowtrans(x), &R, &C, T, p);
+  Rc = indexcompl(R, n);
+  C1 = rowpermute(C, R);
+  C2 = rowpermute(C, Rc);
+  if (R[r] != r) {
+    long i, j;
+    GEN zero = zero_Flx(get_Flx_var(T));
+    for (j = 1; j <= r; j++)
+      for (i = 1; i <= R[j] - j; i++)
+        gcoeff(C2, i, j) = zero;
+  }
+  S = FlxqM_lsolve_lower_unit(C1, C2, T, p);
+  K = vecpermute(shallowconcat(FlxM_neg(S, p), matid_FlxqM(n - r, T, p)),
+                 perm_inv(vecsmall_concat(R, Rc)));
+  K = shallowtrans(K);
+  return gerepilecopy(av, K);
+}
+
+static GEN
+col_ei_FlxC(long n, long i, long sv) {
+  GEN v = zero_FlxC(n, sv);
+  gel(v, i) = pol1_Flx(sv);
+  return v;
+}
+
+static GEN
+FlxqM_deplin_echelon(GEN x, GEN T, ulong p) {
+  pari_sp av = avma;
+  GEN R, Rc, C, C1, C2, s, v;
+  long i, j, n = lg(x) - 1, r, sv = get_Flx_var(T);
+  r = FlxqM_echelon(shallowtrans(x), &R, &C, T, p);
+  if (r == n) { avma = av; return NULL; }
+  Rc = indexcompl(R, n);
+  i = Rc[1];
+  C1 = rowpermute(C, R);
+  C2 = rowslice(C, i, i);
+  if (i <= r) {
+    GEN zero = zero_Flx(sv);
+    for (j = i; j <= r; j++)
+      gcoeff(C2, 1, j) = zero;
+  }
+  s = row(FlxqM_lsolve_lower_unit(C1, C2, T, p), 1);
+  settyp(s, t_COL);
+  v = vecpermute(shallowconcat(FlxC_neg(s, p), col_ei_FlxC(n - r, 1, sv)),
+                 perm_inv(vecsmall_concat(R, Rc)));
+  return gerepilecopy(av, v);
+}
+
+static GEN
+FlxqM_ker_i(GEN x, GEN T, ulong p, long deplin) {
+  if (lg(x) - 1 >= FlxqM_CUP_LIMIT && nbrows(x) >= FlxqM_CUP_LIMIT)
+    return deplin? FlxqM_deplin_echelon(x, T, p): FlxqM_ker_echelon(x, T, p);
+  return FlxqM_ker_gen(x, T, p, deplin);
+}
+
 GEN
 FlxqM_ker(GEN x, GEN T, ulong p)
 {
@@ -1694,6 +2738,8 @@ init_gauss(GEN a, GEN *b, long *aco, long *li, int *iscol)
   return 1;
 }
 
+static GEN Flm_inv_sp(GEN a, ulong *detp, ulong p);
+
 static int
 is_modular_solve(GEN a, GEN b, GEN *u)
 {
@@ -1714,7 +2760,7 @@ is_modular_solve(GEN a, GEN b, GEN *u)
       if (a) a = F2m_to_mod(a);
       break;
     default:
-      a = Flm_inv(a,pp);
+      a = Flm_inv_sp(a, NULL, pp);
       if (a) a = Flm_to_mod(a, pp);
     }
   }
@@ -2231,19 +3277,68 @@ Flm_gauss_sp(GEN a, GEN b, ulong *detp, ulong p)
   return u;
 }
 
+static GEN
+Flm_gauss_CUP(GEN a, GEN b, ulong *detp, ulong p) {
+  GEN R, C, U, P, X, Y;
+  long i, n = lg(a) - 1, r;
+  if (nbrows(a) < n || (r = Flm_CUP(a, &R, &C, &U, &P, p)) < n)
+    return NULL;
+  Y = Flm_rsolve_lower_unit(rowpermute(C, R),
+                            rowpermute(b, R), p);
+  X = rowpermute(Flm_rsolve_upper(U, Y, p),
+                 perm_inv(P));
+  if (detp) {
+    ulong d = perm_sign(P) == 1? 1: p-1;
+    for (i = 1; i <= r; i++)
+      d = Fl_mul(d, ucoeff(U, i, i), p);
+    *detp = d;
+  }
+  return X;
+}
+
 GEN
 Flm_gauss(GEN a, GEN b, ulong p) {
-  return Flm_gauss_sp(RgM_shallowcopy(a), RgM_shallowcopy(b), NULL, p);
+  pari_sp av = avma;
+  GEN x;
+  if (lg(a) - 1 >= Flm_CUP_LIMIT)
+    x = Flm_gauss_CUP(a, b, NULL, p);
+  else {
+    a = RgM_shallowcopy(a);
+    b = RgM_shallowcopy(b);
+    x = Flm_gauss_sp(a, b, NULL, p);
+  }
+  if (!x) { avma = av; return NULL; }
+  return gerepileupto(av, x);
 }
+
+static GEN
+Flm_inv_i(GEN a, ulong *detp, ulong p, long inplace) {
+  pari_sp av = avma;
+  long n = lg(a) - 1;
+  GEN b, x;
+  if (n == 0) return cgetg(1, t_MAT);
+  b = matid_Flm(nbrows(a));
+  if (n >= Flm_CUP_LIMIT)
+    x = Flm_gauss_CUP(a, b, detp, p);
+  else {
+    if (!inplace)
+      a = RgM_shallowcopy(a);
+    x = Flm_gauss_sp(a, b, detp, p);
+  }
+  if (!x) { avma = av; return NULL; }
+  return gerepileupto(av, x);
+}
+
 static GEN
 Flm_inv_sp(GEN a, ulong *detp, ulong p) {
-  if (lg(a) == 1) return cgetg(1,t_MAT);
-  return Flm_gauss_sp(a, matid_Flm(nbrows(a)), detp, p);
+  return Flm_inv_i(a, detp, p, 1);
 }
+
 GEN
 Flm_inv(GEN a, ulong p) {
-  return Flm_inv_sp(RgM_shallowcopy(a), NULL, p);
+  return Flm_inv_i(a, NULL, p, 0);
 }
+
 GEN
 Flm_Flc_gauss(GEN a, GEN b, ulong p) {
   pari_sp av = avma;
@@ -2339,6 +3434,26 @@ FlxqM_gauss_gen(GEN a, GEN b, GEN T, ulong p)
   const struct bb_field *S = get_Flxq_field(&E, T, p);
   return gen_Gauss(a, b, E, S);
 }
+
+static GEN
+FlxqM_gauss_CUP(GEN a, GEN b, GEN T, ulong p) {
+  GEN R, C, U, P, Y;
+  long n = lg(a) - 1, r;
+  if (nbrows(a) < n || (r = FlxqM_CUP(a, &R, &C, &U, &P, T, p)) < n)
+    return NULL;
+  Y = FlxqM_rsolve_lower_unit(rowpermute(C, R),
+                              rowpermute(b, R), T, p);
+  return rowpermute(FlxqM_rsolve_upper(U, Y, T, p),
+                    perm_inv(P));
+}
+
+static GEN
+FlxqM_gauss_i(GEN a, GEN b, GEN T, ulong p) {
+  if (lg(a) - 1 >= FlxqM_CUP_LIMIT)
+    return FlxqM_gauss_CUP(a, b, T, p);
+  return FlxqM_gauss_gen(a, b, T, p);
+}
+
 GEN
 FlxqM_gauss(GEN a, GEN b, GEN T, ulong p)
 {
@@ -2346,7 +3461,7 @@ FlxqM_gauss(GEN a, GEN b, GEN T, ulong p)
   long n = lg(a)-1;
   GEN u;
   if (!n || lg(b)==1) { avma = av; return cgetg(1, t_MAT); }
-  u = FlxqM_gauss_gen(a, b, T, p);
+  u = FlxqM_gauss_i(a, b, T, p);
   if (!u) { avma = av; return NULL; }
   return gerepilecopy(av, u);
 }
@@ -2356,7 +3471,7 @@ FlxqM_inv(GEN a, GEN T, ulong p)
   pari_sp av = avma;
   GEN u;
   if (lg(a) == 1) { avma = av; return cgetg(1, t_MAT); }
-  u = FlxqM_gauss_gen(a, matid_FlxqM(nbrows(a),T,p), T,p);
+  u = FlxqM_gauss_i(a, matid_FlxqM(nbrows(a),T,p), T,p);
   if (!u) { avma = av; return NULL; }
   return gerepilecopy(av, u);
 }
@@ -2366,7 +3481,7 @@ FlxqM_FlxqC_gauss(GEN a, GEN b, GEN T, ulong p)
   pari_sp av = avma;
   GEN u;
   if (lg(a) == 1) return cgetg(1, t_COL);
-  u = FlxqM_gauss_gen(a, mkmat(b), T, p);
+  u = FlxqM_gauss_i(a, mkmat(b), T, p);
   if (!u) { avma = av; return NULL; }
   return gerepilecopy(av, gel(u,1));
 }
@@ -2451,7 +3566,7 @@ ZM_gauss(GEN a, GEN b0)
   for(;;)
   {
     p = u_forprime_next(&S);
-    C = Flm_inv(ZM_to_Flm(a, p), p);
+    C = Flm_inv_sp(ZM_to_Flm(a, p), NULL, p);
     if (C) break;
     elim -= expu(p);
     if (elim < 0) return NULL;
@@ -2919,7 +4034,6 @@ gauss_pivot_ker(GEN x, GEN x0, GEN *dd, long *rr)
 static GEN FpM_gauss_pivot(GEN x, GEN p, long *rr);
 static GEN FqM_gauss_pivot(GEN x, GEN T, GEN p, long *rr);
 static GEN F2m_gauss_pivot(GEN x, long *rr);
-static GEN Flm_gauss_pivot(GEN x, ulong p, long *rry);
 
 /* r = dim ker(x).
  * Returns d:
@@ -3002,7 +4116,7 @@ ZM_pivots(GEN M0, long *rr)
       ulong p = u_forprime_next(&S);
       long rp;
       if (!p) pari_err_OVERFLOW("ZM_pivots [ran out of primes]");
-      d = Flm_gauss_pivot(ZM_to_Flm(M0, p), p, &rp);
+      d = Flm_pivots(ZM_to_Flm(M0, p), p, &rp, 1);
       if (rp == rmin) { rbest = rp; goto END; } /* maximal rank, return */
       if (rp < rbest) { /* save best r so far */
         rbest = rp;
@@ -3287,10 +4401,41 @@ inverseimage(GEN m, GEN v)
 }
 
 static GEN
+Flm_invimage_CUP(GEN A, GEN B, ulong p) {
+  pari_sp av = avma;
+  GEN R, Rc, C, U, P, B1, B2, C1, C2, X, Y, Z;
+  long r;
+  r = Flm_CUP(A, &R, &C, &U, &P, p);
+  Rc = indexcompl(R, nbrows(B));
+  C1 = rowpermute(C, R);
+  C2 = rowpermute(C, Rc);
+  if (R[r] != r) {
+    long i, j;
+    for (j = 1; j <= r; j++)
+      for (i = 1; i <= R[j] - j; i++)
+        ucoeff(C2, i, j) = 0UL;
+  }
+  B1 = rowpermute(B, R);
+  B2 = rowpermute(B, Rc);
+  Z = Flm_rsolve_lower_unit(C1, B1, p);
+  if (!gequal(Flm_mul(C2, Z, p), B2))
+    return NULL;
+  Y = vconcat(Flm_rsolve_upper(vecslice(U, 1, r), Z, p),
+              zero_Flm(lg(A) - 1 - r, lg(B) - 1));
+  X = rowpermute(Y, perm_inv(P));
+  return gerepileupto(av, X);
+}
+
+static GEN
 Flm_invimage_i(GEN A, GEN B, ulong p)
 {
   GEN d, x, X, Y;
   long i, j, nY, nA = lg(A)-1, nB = lg(B)-1;
+
+  if (!nB) return cgetg(1, t_MAT);
+  if (nA + nB >= Flm_CUP_LIMIT && nbrows(B) >= Flm_CUP_LIMIT)
+    return Flm_invimage_CUP(A, B, p);
+
   x = Flm_ker_sp(shallowconcat(Flm_neg(A,p), B), p, 0);
   /* AX = BY, Y in strict upper echelon form with pivots = 1.
    * We must find T such that Y T = Id_nB then X T = Z. This exists iff
@@ -3524,16 +4669,17 @@ FpM_suppl(GEN x, GEN p)
   init_suppl(x); d = FpM_gauss_pivot(x,p, &r);
   avma = av; return get_suppl(x,d,nbrows(x),r,&col_ei);
 }
+
 GEN
 Flm_suppl(GEN x, ulong p)
 {
   pari_sp av = avma;
   GEN d;
   long r;
-  init_suppl(x); d = Flm_gauss_pivot(Flm_copy(x),p, &r);
+  init_suppl(x); d = Flm_pivots(x, p, &r, 0);
   avma = av; return get_suppl(x,d,nbrows(x),r,&vecsmall_ei);
-
 }
+
 GEN
 F2m_suppl(GEN x)
 {
@@ -3756,15 +4902,17 @@ FpM_indexrank(GEN x, GEN p) {
   d = FpM_gauss_pivot(x,p,&r);
   avma = av; return indexrank0(lg(x)-1, r, d);
 }
+
 GEN
 Flm_indexrank(GEN x, ulong p) {
   pari_sp av = avma;
   long r;
   GEN d;
   init_indexrank(x);
-  d = Flm_gauss_pivot(Flm_copy(x),p,&r);
+  d = Flm_pivots(x, p, &r, 0);
   avma = av; return indexrank0(lg(x)-1, r, d);
 }
+
 GEN
 F2m_indexrank(GEN x) {
   pari_sp av = avma;
@@ -3892,7 +5040,7 @@ FlkM_inv(GEN M, GEN P, ulong p)
   GEN V = cgetg(l, t_VEC);
   for(i=1; i<l; i++)
   {
-    gel(V, i) = Flm_inv(FlxM_eval_powers_pre(M, Fl_powers_pre(uel(R,i), degpol(P), p, pi), p, pi), p);
+    gel(V, i) = Flm_inv_sp(FlxM_eval_powers_pre(M, Fl_powers_pre(uel(R,i), degpol(P), p, pi), p, pi), NULL, p);
     if (!gel(V, i)) return NULL;
   }
   return FlmV_recover(V,W,p);
@@ -4632,7 +5780,7 @@ ZM_det_i(GEN M, long n)
   p = 0; /* -Wall */
   while( cmpii(q, h) <= 0 && (p = u_forprime_next(&S)) )
   {
-    av2 = avma; Dp = Flm_det(ZM_to_Flm(M, p), p);
+    av2 = avma; Dp = Flm_det_sp(ZM_to_Flm(M, p), p);
     avma = av2;
     if (Dp) break;
     q = muliu(q, p);
@@ -4669,7 +5817,7 @@ ZM_det_i(GEN M, long n)
     Dp = umodiu(D, p);
     if (!Dp) continue;
     av2 = avma;
-    compp = Fl_div(Flm_det(ZM_to_Flm(M, p), p), Dp, p);
+    compp = Fl_div(Flm_det_sp(ZM_to_Flm(M, p), p), Dp, p);
     avma = av2;
     (void) Z_incremental_CRT(&comp, compp, &q, p);
   }
@@ -4709,7 +5857,7 @@ det(GEN a)
     {
     case 0: return gerepileupto(av, Fp_to_mod(FpM_det(a,p),p)); break;
     case 2: d = F2m_det(a); break;
-    default:d = Flm_det(a,pp); break;
+    default:d = Flm_det_sp(a, pp); break;
     }
     avma = av; return mkintmodu(d, pp);
   }
-- 
2.7.3

>From 2404f455d8ad02acf946a71bda6ac8386c83d7f1 Mon Sep 17 00:00:00 2001
From: Peter Bruin <P.J.Bruin@math.leidenuniv.nl>
Date: Thu, 16 Feb 2017 07:14:59 +0100
Subject: [PATCH 4/5] add tests

---
 src/test/32/mat | 32 ++++++++++++++++++++++++++++++++
 src/test/in/mat | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/src/test/32/mat b/src/test/32/mat
index 9a3854e..2eaaff7 100644
--- a/src/test/32/mat
+++ b/src/test/32/mat
@@ -463,4 +463,36 @@ Vecsmall([2, 2, 0, -4, -10])
 [;]
 [;]
 [;]
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+[]~
+[;]
+[;]
+[;]
+[;]
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+[]~
+[;]
+[;]
+[;]
+[;]
 Total time spent: 104
diff --git a/src/test/in/mat b/src/test/in/mat
index f4c124a..22da0ea 100644
--- a/src/test/in/mat
+++ b/src/test/in/mat
@@ -154,3 +154,36 @@ q = nextprime(p + 1);
           matker([p^v[1]*q^v[2], p^v[3]*q^v[4];
                   p^v[5]*q^v[6], p^v[7]*q^v[8]], 1));
 }
+
+test(x) =
+{
+   n = 22; r = 12;
+   M = matrix(n, n, i, j, random(x));
+   P = matrix(n, r, i, j, if(2*i >= 3*j + 8, random(x), 0*x));
+   N = P*matrix(r, n, i, j, random(x));
+   Q = N[,1..r];
+   R = mattranspose(P);
+   X = matsolve(M, N);
+   K = matker(N);
+   J = matimage(N);
+   print(M*X == N);
+   print(N*K == 0);
+   print(N*lindep(N) == 0);
+   print(R*lindep(R) == 0);
+   print(matimage(J) == J);
+   print(matrank(J) == r);
+   print(matrank(K) == n - r);
+   print(matrank(N) == r);
+   print(J^-1 * J == matid(r));
+   print(matdet(M^-1) == matdet(M)^-1);
+   print(P*matinverseimage(P, Q) == Q);
+   print(lindep(J));
+   print(matker(K));
+   print(matker(M));
+   print(matinverseimage(N, M));
+   print(matinverseimage(P, M));
+   iferr(N^-1, e,, errname(e) == "e_INV");
+   iferr(matsolve(N, M), e,, errname(e) == "e_INV");
+}
+test(Mod(1, 8161));
+test(ffgen(2017^3));
-- 
2.7.3

>From fb5eedb09e5021c80acca2bf59d2c1840549017c Mon Sep 17 00:00:00 2001
From: Peter Bruin <P.J.Bruin@math.leidenuniv.nl>
Date: Mon, 6 Mar 2017 12:01:15 +0100
Subject: [PATCH 5/5] update test output

---
 src/test/32/nf        |  2 +-
 src/test/32/rnfkummer | 49 ++++++++++++++++++++++++-------------------------
 2 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/src/test/32/nf b/src/test/32/nf
index 408f4e8..ae08b2c 100644
--- a/src/test/32/nf
+++ b/src/test/32/nf
@@ -7,7 +7,7 @@
 [1, [], []]
 20915648110955829231381594293324156411897455346679838307589120000
 571459344155975480004612560667633185714077696
-[54898, [54898], [[7, 0; 0, 1]]]
+[54898, [54898], [[17, 15; 0, 1]]]
 [26, [26], [[19, 15, 18, 5; 0, 1, 0, 0; 0, 0, 1, 0; 0, 0, 0, 1]]]
 [1, [], []]
 [3, 3]
diff --git a/src/test/32/rnfkummer b/src/test/32/rnfkummer
index a179c1e..6923dae 100644
--- a/src/test/32/rnfkummer
+++ b/src/test/32/rnfkummer
@@ -9,25 +9,24 @@ x^3 + (-455492756088*y^3 + 72746635392*y^2 + 30103312785*y - 532729124469)*x
  + (-4059137465271111971*y^3 + 652777677080133659*y^2 + 262952422699146889*y
  - 4742293189040084766)
 3
-x^5 + (-2672499745438128324166624371403175/229919*y^3 - 10809720543044940554
-5848004057516450/229919*y^2 + 1946314387168194463838027397756897075/229919*y
- + 74382434311247931914387696921552317765/229919)*x^3 + (5129305801893146709
-05707673158636170489515232134558120277188077675/229919*y^3 + 207207789641531
-13686108930191955787281793237860995272607395332803700/229919*y^2 - 375311020
-623795738763355889825099665721475609297058319117285472881825/229919*y - 1430
-3665596945640348838133533759481776279298319916718872305953671726070/229919)*
-x + (-2689189441675545313240964630807739710646668174590848349012607345577703
-904627145140/229919*y^3 - 10863446392599003895107039111912213295761589155826
-4911346511142382718596917639198030/229919*y^2 + 1967699497774698568707239554
-459334515216099647613836963121675081938147097045947197610/229919*y + 7499150
-4294653703573704123086807127325197706131462485408742266590128280995815043988
-790/229919)
+x^5 + (338906430826501923385346607707925/229919*y^3 - 1457798214872402183464
+6220260998400/229919*y^2 - 202240490266682646928234413452546325/229919*y + 9
+311818671232806283499820722904784405/229919)*x^3 + (-97462976500434619194588
+15391166956054900191851407121368976375225/229919*y^3 + 419234752507246797754
+440638037503934193147867526951273244896871800/229919*y^2 + 58160478163139363
+18046410689887796592739326364262547764388484603275/229919*y - 26779001130318
+2612612928927273255972900605400487205732555912057079030/229919)*x + (1383785
+9122129794978630768313810944358518067010816386533939954568750960963976700/22
+9919*y^3 - 59523232848017545627423707739853544444403314870247075668056585441
+8672632321333000/229919*y^2 - 8257663906803008790492508500548575219481689071
+133261565693002578393264223628510850/229919*y + 3802100637374297497910983850
+11734464444507295841035234635984867875452558433080616962/229919)
 4
 x^5 + 110*x^3 - 385*x^2 + (-6875*y + 13310)*x + (20625*y + 32197)
 5
-x^5 + 1497313810*x^3 - 25448437930974984193005709510660440*x^2 + 14522162707
-1192240665553733936628852906585*x + 4539521583971768757621570094824005976310
-535075196404282324328
+x^5 + 3004517690*x^3 - 28917505076965983010807657580*x^2 - 80540351781678421
+251251538235548863606045415*x + 31011472834856538584168163496443071927439210
+5424319898991797804
 6
 x^3 + (6*y^4 - 6*y^3 + 6*y^2 - 6*y - 6)*x + (18*y^5 + 22*y^3 + 16*y^2 + 6*y 
 + 2)
@@ -51,15 +50,15 @@ x^3 + (22049832980412*y - 801716887536675)*x + (321027751930293073102*y - 11
 10
 x^5 - 10*x^3 + 20*x + 10
 11
-x^5 + (51814919566041140/27*y^3 - 512130951433266020/27*y^2 - 12296256927237
-71840/9*y + 367446491842204360)*x^3 + (3414154904347712686931950/81*y^3 - 33
-744998817726993590949550/81*y^2 - 81021694609866878539643860/27*y + 80705149
-06462415911392980)*x^2 + (-15185325298365785928052588484315210/27*y^3 + 1500
-89494647008327107727889840385730/27*y^2 + 3603646651500540639118318823844391
-40/9*y - 107687024414631416745446738401726620)*x + (-57823304198737588116852
-2135931847246530148/81*y^3 + 5715169306872036759851381101453859510319964/81*
-y^2 + 13722113452314566641868897163626950825646644/27*y - 136685022077514790
-3609605218580985801810226)
+x^5 + (29680475428049860/9*y^3 + 34921354500260180/9*y^2 - 15477290427050761
+40/3*y - 2577261173805358400)*x^3 + (-12081003902993919799182130/81*y^3 - 14
+214245862026090743389440/81*y^2 + 629980400564212569840119770/27*y + 1165595
+99110534944226492390)*x^2 + (-41713649969867415576551797389166070/27*y^3 - 4
+9079370978551972052667611917864830/27*y^2 + 21752150855081820680445678911540
+81150/9*y + 1207381363290003193446783411707755890)*x + (-1624409998713015666
+0037949992862545643869538/81*y^3 - 19112453838427485951263179674887910269786
+026/81*y^2 + 847070715893511796620187790773814796016991384/27*y + 1567258647
+31306991463206177661754686074438714)
 12
 x^5 + (-81813555181352222150051800*y - 758707698512046332750174210)*x^3 + (-
 694598776821165938962621533366567137400*y - 64414440636774569791913727277961
-- 
2.7.3

Attachment: timings.gp
Description: Binary data

matdet([480, 480])
48
matrank([480, 480])
606
matrank([480, 320])
303
matindexrank([480, 480])
606
matindexrank([480, 320])
303
matker([480, 480])
130
matker([480, 320])
47
lindep([480, 480])
130
lindep([480, 320])
45
matimage([480, 480])
613
matimage([480, 320])
304
matsupplement([480, 480])
613
matsupplement([480, 320])
306
(t)->my(m=480,n=320,r=240,x=Mod(1, 2017));t^-1([480, 480])
475
(t)->my(m=480,n=320,r=240,x=Mod(1, 2017));matsolve(t,id)([480, 480])
472
matsolve([480, 480], [480, 320])
426
matinverseimage([480, 480], [480, 320])
434
matinverseimage([480, 320], [480, 240])
168
(t)->my(m=480,n=320,r=240,x=Mod(1, 2017));matinverseimage(t,id)([480, 480])
692
matdet([480, 480])
295
matrank([480, 480])
1347
matrank([480, 320])
670
matindexrank([480, 480])
1347
matindexrank([480, 320])
670
matker([480, 480])
555
matker([480, 320])
228
lindep([480, 480])
554
lindep([480, 320])
228
matimage([480, 480])
1356
matimage([480, 320])
674
matsupplement([480, 480])
1356
matsupplement([480, 320])
677
(t)->my(m=480,n=320,r=240,x=Mod(1, 9223372036854775837));t^-1([480, 480])
2851
(t)->my(m=480,n=320,r=240,x=Mod(1, 9223372036854775837));matsolve(t,id)([480, 480])
2840
matsolve([480, 480], [480, 320])
2006
matinverseimage([480, 480], [480, 320])
1574
matinverseimage([480, 320], [480, 240])
618
(t)->my(m=480,n=320,r=240,x=Mod(1, 9223372036854775837));matinverseimage(t,id)([480, 480])
2489
matdet([120, 120])
2097
matrank([120, 120])
121
matrank([120, 80])
59
matindexrank([120, 120])
120
matindexrank([120, 80])
59
matker([120, 120])
178
matker([120, 80])
74
lindep([120, 120])
178
lindep([120, 80])
74
matimage([120, 120])
122
matimage([120, 80])
61
matsupplement([120, 120])
124
matsupplement([120, 80])
61
(t)->my(m=120,n=80,r=60,x=x);t^-1([120, 120])
407
(t)->my(m=120,n=80,r=60,x=x);matsolve(t,id)([120, 120])
410
matsolve([120, 120], [120, 80])
395
matinverseimage([120, 120], [120, 80])
447
matinverseimage([120, 80], [120, 60])
176
(t)->my(m=120,n=80,r=60,x=x);matinverseimage(t,id)([120, 120])
444
matdet([120, 120])
119
(x)->matker(x,1)([120, 120])
936
msnew(msinit(800, 2, 1))
464
total time: 36182
matdet([480, 480])
52
matrank([480, 480])
54
matrank([480, 320])
30
matindexrank([480, 480])
54
matindexrank([480, 320])
29
matker([480, 480])
57
matker([480, 320])
32
lindep([480, 480])
54
lindep([480, 320])
28
matimage([480, 480])
63
matimage([480, 320])
33
matsupplement([480, 480])
63
matsupplement([480, 320])
35
(t)->my(m=480,n=320,r=240,x=Mod(1, 2017));t^-1([480, 480])
209
(t)->my(m=480,n=320,r=240,x=Mod(1, 2017));matsolve(t,id)([480, 480])
219
matsolve([480, 480], [480, 320])
163
matinverseimage([480, 480], [480, 320])
165
matinverseimage([480, 320], [480, 240])
70
(t)->my(m=480,n=320,r=240,x=Mod(1, 2017));matinverseimage(t,id)([480, 480])
221
matdet([480, 480])
114
matrank([480, 480])
120
matrank([480, 320])
62
matindexrank([480, 480])
114
matindexrank([480, 320])
59
matker([480, 480])
122
matker([480, 320])
62
lindep([480, 480])
120
lindep([480, 320])
56
matimage([480, 480])
123
matimage([480, 320])
64
matsupplement([480, 480])
123
matsupplement([480, 320])
66
(t)->my(m=480,n=320,r=240,x=Mod(1, 9223372036854775837));t^-1([480, 480])
440
(t)->my(m=480,n=320,r=240,x=Mod(1, 9223372036854775837));matsolve(t,id)([480, 480])
440
matsolve([480, 480], [480, 320])
334
matinverseimage([480, 480], [480, 320])
338
matinverseimage([480, 320], [480, 240])
143
(t)->my(m=480,n=320,r=240,x=Mod(1, 9223372036854775837));matinverseimage(t,id)([480, 480])
442
matdet([120, 120])
49
matrank([120, 120])
49
matrank([120, 80])
26
matindexrank([120, 120])
120
matindexrank([120, 80])
61
matker([120, 120])
50
matker([120, 80])
30
lindep([120, 120])
50
lindep([120, 80])
26
matimage([120, 120])
51
matimage([120, 80])
27
matsupplement([120, 120])
123
matsupplement([120, 80])
62
(t)->my(m=120,n=80,r=60,x=x);t^-1([120, 120])
151
(t)->my(m=120,n=80,r=60,x=x);matsolve(t,id)([120, 120])
152
matsolve([120, 120], [120, 80])
141
matinverseimage([120, 120], [120, 80])
142
matinverseimage([120, 80], [120, 60])
60
(t)->my(m=120,n=80,r=60,x=x);matinverseimage(t,id)([120, 120])
152
matdet([120, 120])
85
(x)->matker(x,1)([120, 120])
697
msnew(msinit(800, 2, 1))
314
total time: 7591