Accurate floating-point computation of the sum and difference of two products

Question

The difference of two products and the sum of two products are two primitives found in a variety of common computations. diff_of_products (a,b,c,d) := ab - cd and sum_of_products(a,b,c,d) := ab + cd are closely-related companion functions that differ only by the sign of some of their operands. Examples for the use of these primitives are:

Computation of a complex multiplication with x = (a + i b) and y = (c + i d):

x*y = diff_of_products (a, c, b, d) + i sum_of_products (a, d, b, c)

Computation of the determinant of a 2x2 matrix: diff_of_products (a, d, b, c):

| a  b |
| c  d |

In a right-angle triangle computation of the length of the opposite cathesus from the hypothenuse h and adjacent cathetus a: diff_of_products (h, h, a, a)

Computation of the two real solutions of a quadratic equation with positive discriminant:

q = -(b + copysign (sqrt (diff_of_products (b, b, 4a, c)), b)) / 2
x₀ = q / a
x₁ = c / q

Computation of a 3D cross product a = b ⨯ c:

a_x = diff_of_products (b_y, c_z, b_z, c_y)
a_y = diff_of_products (b_z, c_x, b_x, c_z)
a_z = diff_of_products (b_x, c_y, b_y, c_x)

When computing with IEEE-754 binary floating-point formats, besides obvious issues with potential overflow and underflow, naive implementations of either function can suffer from catastrophic cancellation when the two products are similar in magnitude but of opposite signs for sum_of_products() or same sign for diff_of_products().

Focusing only on the accuracy aspect, how can one implement these functions robustly in the context of IEEE-754 binary arithmetic? The availability of fused multiply-add operations can be assumed, as this operation is supported by most modern processor architectures and exposed, via standard functions, in many programming languages. Without loss of generality, discussion can be restricted to single precision (IEEE-754 binary32) format for ease of exposition and testing.

njuffa njuffa · Accepted Answer · 2020-08-31T04:57:37

The utility of the fused-multiply add (FMA) operation in providing protection against subtractive cancellation stems from the participation of the full double-width product in the final addition. To my knowledge, the first publicl record of its utility for accurately and robustly computing the solutions of quadratic equations are two sets of informal notes by renowned floating-point expert William Kahan:

William Kahan, "Matlab’s Loss is Nobody’s Gain". August 1998, revised July 2004 (online)
William Kahan, "On the Cost of Floating-Point Computation Without Extra-Precise Arithmetic". November 2004 (online)

The standard work on numerical computing by Higham was the first in which I encountered Kahan's algorithm applied to the computation of the determinant of a 2x2 matrix (p. 65):

Nicholas J. Higham, "Accuracy and Stability of Numerical Algorithms", SIAM 1996

A different algorithm for the computation of ab+cd, also based on FMA, was published by three Intel researchers in the context of Intel's first CPU with FMA support, the Itanium processor (p. 273):

Marius Cornea, John Harrison, and Ping Tak Peter Tang: "Scientific Computing on Itanium-based Systems." Intel Press 2002

In recent years, four papers by French researchers examined both algorithms in detail and provided error bounds with mathematical proofs. For binary floating-point arithmetic, provided there is no overflow or underflow in intermediate computation, the maximum relative error of both Kahan's algorithm and the Cornea-Harrison-Tang (CHT) algorithm were shown to be twice the unit round-off asymptotically, that is, 2u. For IEEE-754 binary32 or single precision this error bound is 2^-23 and for IEEE-754 binary64 or double precision this error bound is 2^-52.

Furthermore it was shown that the error in Kahan's algorithm is at most 1.5 ulps for binary floating-point arithmetic. From the literature I am not aware of an equivalent result, that is, a proven ulp error bound, for the CHT algorithm. My own experiments using the code below suggest an error bound of 1.25 ulp.

Sylvie Boldo, "Kahan’s algorithm for a correct discriminant computation at last formally proven", IEEE Transactions on Computers, Vol. 58, No. 2, February 2009, pp. 220-225 (online)

Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, "Further Analysis of Kahan's Algorithm for the Accurate Computation of 2x2 Determinants", Mathematics of Computation, Vol. 82, No. 284, Oct. 2013, pp. 2245-2264 (online)

Jean-Michel Muller, "On the Error of Computing ab+cd using Cornea, Harrison and Tang's Method", ACM Transactions on Mathematical Software, Vol. 41, No.2, January 2015, Article 7 (online)

Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software Vol. 42, No. 3, May 2016, Article 19 (online)

While Kahan's algorithm requires four floating-point operations, two of which are FMAs, the CHT algorithm requires seven floating-point operations, two of which are FMAs. I constructed the test framework below to explore what other trade-offs may exist. I experimentally confirmed the bounds from the literature on the relative error of both algorithms and the ulp error of Kahan's algorithm. My experiments indicate that the CHT algorithm provides a smaller ulp error bound of 1.25 ulp, but that it also produces incorrectly-rounded results at roughly twice the rate of Kahan's algorithm.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <float.h>
#include <math.h>

#define TEST_SUM  (0)  // function under test. 0: a*b-c*d; 1: a*b+c*d 
#define USE_CHT   (0)  // algorithm. 0: Kahan; 1: Cornea-Harrison-Tang

/*
  Compute a*b-c*d with error <= 1.5 ulp. Maximum relative err = 2**-23

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller,
  "Further Analysis of Kahan's Algorithm for the Accurate Computation 
  of 2x2 Determinants", Mathematics of Computation, Vol. 82, No. 284, 
  Oct. 2013, pp. 2245-2264
*/
float diff_of_products_kahan (float a, float b, float c, float d)
{
    float w = d * c;
    float e = fmaf (c, -d, w);
    float f = fmaf (a, b, -w);
    return f + e;
}

/*
  Compute a*b-c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23

  Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the 
  Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
  Vol. 42, No. 3, Article 19 (May 2016).
*/
float diff_of_products_cht (float a, float b, float c, float d)
{
    float p1 = a * b; 
    float p2 = c * d;
    float e1 = fmaf (a, b, -p1); 
    float e2 = fmaf (c, -d, p2);
    float r = p1 - p2; 
    float e = e1 + e2;
    return r + e;
}

/*
  Compute a*b+c*d with error <= 1.5 ulp. Maximum relative err = 2**-23

  Jean-Michel Muller, "On the Error of Computing ab+cd using Cornea,
  Harrison and Tang's Method", ACM Transactions on Mathematical Software, 
  Vol. 41, No.2, Article 7, (January 2015)
*/
float sum_of_products_kahan (float a, float b, float c, float d)
{
    float w = c * d;
    float e = fmaf (c, -d, w);
    float f = fmaf (a, b, w);
    return f - e;
}

/*
  Compute a*b+c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23

  Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the 
  Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
  Vol. 42, No. 3, Article 19 (May 2016).
*/
float sum_of_products_cht (float a, float b, float c, float d)
{
    float p1 = a * b; 
    float p2 = c * d;
    float e1 = fmaf (a, b, -p1); 
    float e2 = fmaf (c, d, -p2);
    float r = p1 + p2; 
    float e = e1 + e2;
    return r + e;
}

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

typedef struct {
    double y;
    double x;
} dbldbl;

dbldbl make_dbldbl (double head, double tail)
{
    dbldbl z;
    z.x = tail;
    z.y = head;
    return z;
}

dbldbl add_dbldbl (dbldbl a, dbldbl b) {
    dbldbl z;
    double t1, t2, t3, t4, t5;
    t1 = a.y + b.y;
    t2 = t1 - a.y;
    t3 = (a.y + (t2 - t1)) + (b.y - t2);
    t4 = a.x + b.x;
    t2 = t4 - a.x;
    t5 = (a.x + (t2 - t4)) + (b.x - t2);
    t3 = t3 + t4;
    t4 = t1 + t3;
    t3 = (t1 - t4) + t3;
    t3 = t3 + t5;
    z.y = t4 + t3;
    z.x = (t4 - z.y) + t3;
    return z;
}

dbldbl sub_dbldbl (dbldbl a, dbldbl b)
{
    dbldbl z;
    double t1, t2, t3, t4, t5;
    t1 = a.y - b.y;
    t2 = t1 - a.y;
    t3 = (a.y + (t2 - t1)) - (b.y + t2);
    t4 = a.x - b.x;
    t2 = t4 - a.x;
    t5 = (a.x + (t2 - t4)) - (b.x + t2);
    t3 = t3 + t4;
    t4 = t1 + t3;
    t3 = (t1 - t4) + t3;
    t3 = t3 + t5;
    z.y = t4 + t3;
    z.x = (t4 - z.y) + t3;
    return z;
}

dbldbl mul_dbldbl (dbldbl a, dbldbl b)
{
    dbldbl t, z;
    t.y = a.y * b.y;
    t.x = fma (a.y, b.y, -t.y);
    t.x = fma (a.x, b.x, t.x);
    t.x = fma (a.y, b.x, t.x);
    t.x = fma (a.x, b.y, t.x);
    z.y = t.y + t.x;
    z.x = (t.y - z.y) + t.x;
    return z;
}

double prod_diff_ref (float a, float b, float c, float d)
{
    dbldbl t = sub_dbldbl (
        mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
        mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
        );
    return t.x + t.y;
}

double prod_sum_ref (float a, float b, float c, float d)
{
    dbldbl t = add_dbldbl (
        mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
        mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
        );
    return t.x + t.y;
}

float __uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint32_t __float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint64_t __double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

static double floatUlpErr (float res, double ref)
{
    uint64_t i, j, err;
    int expoRef;
    
    /* ulp error cannot be computed if either operand is NaN, infinity, zero */
    if (isnan(res) || isnan (ref) || isinf(res) || isinf (ref) ||
        (res == 0.0f) || (ref == 0.0f)) {
        return 0.0;
    }
    /* Convert the float result to an "extended float". This is like a float
       with 56 instead of 24 effective mantissa bits.
    */
    i = ((uint64_t)__float_as_uint32(res)) << 32;
    /* Convert the double reference to an "extended float". If the reference is
       >= 2^129, we need to clamp to the maximum "extended float". If reference
       is < 2^-126, we need to denormalize because of float's limited exponent
       range.
    */
    expoRef = (int)(((__double_as_uint64(ref) >> 52) & 0x7ff) - 1023);
    if (expoRef >= 129) {
        j = (__double_as_uint64(ref) & 0x8000000000000000ULL) |
            0x7fffffffffffffffULL;
    } else if (expoRef < -126) {
        j = ((__double_as_uint64(ref) << 11) | 0x8000000000000000ULL) >> 8;
        j = j >> (-(expoRef + 126));
        j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
    } else {
        j = ((__double_as_uint64(ref) << 11) & 0x7fffffffffffffffULL) >> 8;
        j = j | ((uint64_t)(expoRef + 127) << 55);
        j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
    }
    err = (i < j) ? (j - i) : (i - j);
    return err / 4294967296.0;
}

int main (void)
{
    const float ULMT = sqrtf (FLT_MAX) / 2; // avoid overflow
    const float LLMT = sqrtf (FLT_MIN) * 2; // avoid underflow
    const uint64_t N = 1ULL << 38;
    double ref, ulp, relerr, maxrelerr = 0, maxulp = 0;
    uint64_t count = 0LL, incorrectly_rounded = 0LL;
    uint32_t ai, bi, ci, di;
    float af, bf, cf, df, resf;

#if TEST_SUM
    printf ("testing a*b+c*d ");
#else
    printf ("testing a*b-c*d ");
#endif // TEST_SUM
#if USE_CHT
    printf ("using Cornea-Harrison-Tang algorithm\n");
#else
    printf ("using Kahan algorithm\n");
#endif

    do {
        do {
            ai = KISS;
            af = __uint32_as_float (ai);
        } while (!isfinite(af) || (fabsf (af) > ULMT) || (fabsf (af) < LLMT));
        do {
            bi = KISS;
            bf = __uint32_as_float (bi);
        } while (!isfinite(bf) || (fabsf (bf) > ULMT) || (fabsf (bf) < LLMT));
        do {
            ci = KISS;
            cf = __uint32_as_float (ci);
        } while (!isfinite(cf) || (fabsf (cf) > ULMT) || (fabsf (cf) < LLMT));
        do {
            di = KISS;
            df = __uint32_as_float (di);
        } while (!isfinite(df) || (fabsf (df) > ULMT) || (fabsf (df) < LLMT));
        count++;
#if TEST_SUM        
#if USE_CHT
        resf = sum_of_products_cht (af, bf, cf, df);
#else // USE_CHT
        resf = sum_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
        ref = prod_sum_ref (af, bf, cf, df);
#else // TEST_SUM
#if USE_CHT
        resf = diff_of_products_cht (af, bf, cf, df);
#else // USE_CHT
        resf = diff_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
        ref = prod_diff_ref (af, bf, cf, df);
#endif // TEST_SUM
        ulp = floatUlpErr (resf, ref);
        incorrectly_rounded += ulp > 0.5;
        relerr = fabs ((resf - ref) / ref);
        if ((ulp > maxulp) || ((ulp == maxulp) && (relerr > maxrelerr))) {
            maxulp = ulp;
            maxrelerr = relerr;
            printf ("%13llu %12llu ulp=%.9f a=% 15.8e b=% 15.8e c=% 15.8e d=% 15.8e res=% 16.6a ref=% 23.13a relerr=%13.9e\n",
                    count, incorrectly_rounded, ulp, af, bf, cf, df, resf, ref, relerr);
        }
    } while (count <= N);

    return EXIT_SUCCESS;
}

Accurate floating-point computation of the sum and difference of two products

1 Answers