1
votes

Using awk/gawk, I need to perform numeric comparisons involving NaN floating point values. Despite the fact that gawk seems to have correctly converted my user input as a numeric value NaN (i.e. not the string "NaN"), the results of comparisons performed with operators '<' or '>' don't match what I would expect.

Expectation:

Comparisons such as x > y, or x < y, where x is NaN and y is a floating point value (including NaN and +/-Infinity), should evaluate to false. [citation to an IEEE document needed (but wikipedia NaN has the table)].

Actual result:

NaN < 2.0 == 0, but NaN > 2.0 == 1

The following snippet takes the first field and adds 0 to it to force conversion to integer (as described in the gnu awk manual). It then uses printf to show the type of the variables and expressions (my particular version of gawk doesn't have typeof()).

$ echo -e "+nan\n-nan\nfoo\nnanny" | awk \
'{x=($1+0); printf "%s: float=%f str=%s x<2==%f x>2==%f\n",$1,x,x,(x<2.0),(x>2.0);}'

+nan: float=nan str=nan x<2==0.000000 x>2==1.000000
-nan: float=nan str=nan x<2==0.000000 x>2==1.000000
foo: float=0.000000 str=0 x<2==1.000000 x>2==0.000000
nanny: float=0.000000 str=0 x<2==1.000000 x>2==0.000000

$ echo -e "+nan\n-nan\nfoo\nnanny" | awk --posix \
'{x=($1+0); printf "%s: float=%f str=%s x<2==%f x>2==%f\n",$1,x,x,(x<2.0),(x>2.0);}'           

+nan: float=nan str=nan x<2==0.000000 x>2==1.000000
-nan: float=nan str=nan x<2==0.000000 x>2==1.000000
foo: float=0.000000 str=0 x<2==1.000000 x>2==0.000000
nanny: float=nan str=nan x<2==0.000000 x>2==1.000000

Running GNU Awk 4.1.3, API: 1.1

Is there a different way/option to have NaNs propagate properly? I read the page on standards vs practice which talks about NaN, and I think I'm going about it correctly. I get the feeling NaN is perhaps not super well baked into awk. I couldn't find a reliable way to test whether a value was a NaN, for instance (other than via printf).

1

1 Answers

3
votes

What does POSIX have to say? First of all, POSIX allows, but does not require that awk supports NaN or Inf values. From the awk IEEE Std 1003.1-2017 POSIX standard:

Historical implementations of awk did not support floating-point infinities and NaNs in numeric strings; e.g., "-INF" and "NaN". However, implementations that use the atof() or strtod() functions to do the conversion picked up support for these values if they used an ISO/IEC 9899:1999 standard version of the function instead of an ISO/IEC 9899:1990 standard version. Due to an oversight, the 2001 through 2004 editions of this standard did not allow support for infinities and NaNs, but in this revision, support is allowed (but not required). This is a silent change to the behavior of awk programs; for example, in the POSIX locale the expression:

("-INF" + 0 < 0)

formerly had the value 0 because "-INF" converted to 0, but now it may have the value 0 or 1.

How does GNU awk handle such magic IEEE numbers? The GNU awk manual states:

  • Without --posix, gawk interprets the four string values "+inf", "-inf", "+nan", and "-nan" specially, producing the corresponding special numeric values. The leading sign acts as a signal to gawk (and the user) that the value is really numeric.
  • With the --posix command-line option, gawk becomes "hands off". String values are passed directly to the system library’s strtod() function, and if it successfully returns a numeric value, that is what’s used. By definition, the results are not portable across different systems.

So in short, GNU awk — without the --posix option — is only able to successfully convert the strings "+nan", "-nan", "+inf" and "-inf" into a floating point representation (See the function is_ieee_magic_val).

Surprisingly, it does not convert "nan" and "inf", especially since the string conversion of "+nan"+0 is a signless "nan"

$ gawk 'BEGIN{print "+nan"+0, "nan"+0}'
nan 0

Remark: when using --posix, GNU awk might recognize the strings "nan" and "inf" as well as others such as "infinity" or the completely unexpected,"nano" or "info". The latter is probably the main reason why — when not using --posix — the sign is paramount and only the strings "+nan", "-nan", "+inf" and "-inf" are recognized.

How does GNU awk sort such magic IEEE numbers?

When digging into the source code of GNU awk, we find the following comment for the routine cmp_awknums:

/*
 * This routine is also used to sort numeric array indices or values.
 * For the purposes of sorting, NaN is considered greater than
 * any other value, and all NaN values are considered equivalent and equal.
 * This isn't in compliance with IEEE standard, but compliance w.r.t. NaN
 * comparison at the awk level is a different issue and needs to be dealt
 * within the interpreter for each opcode separately.
 */

This explains the OP's original question why NaN is not following IEEE comparisons and thus ("+nan"+0<2) is 0 (false) and ("+nan"+0>2) is 1 (true). (Remark: we added a zero to the string to ensure a numeric conversion)

This can be demonstrated with the following code (no --posix):

BEGIN { s = "1.0 +nan 0.0 -1 +inf -0.0 1 1.0 -nan -inf 2.0"; split(s, a)
        PROCINFO["sorted_in"] = "@val_num_asc"
        for (i in a) printf a[i] OFS; printf "\n"
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (i in a) printf a[i] OFS; printf "\n"
      }

outputs the following orderings :

-inf -1 -0.0 0.0 1 1.0 1.0 2.0 +inf +nan -nan
-nan +nan +inf 2.0 1.0 1.0 1 0.0 -0.0 -1 -inf

If NaN would follow the IEEE conventions, it should always appear at the beginning of the list, disregarding the order, but this is clearly not the case. The same holds when using --posix:

function arr_sort(arr,   x, y, z) {
  for (x in arr) { y = arr[x]; z = x - 1
     # force numeric comp
     while (z && arr[z]+0 > y+0) { arr[z + 1] = arr[z]; z-- }
    arr[z + 1] = y
  }
}
BEGIN { s = "1.0 +nan 0.0 -1 +inf -0.0 1 1.0 -nan -inf 2.0"
        s = s" inf nan info -infinity"; split(s, a)
       arr_sort(a)
       for (i in a) printf a[i] OFS; printf "\n"   
}
-inf -infinity -1 0.0 -0.0 1.0 1 1.0 2.0 +inf inf info +nan -nan nan 

Note that the string "info" is seen as an infinity while — without --posix — it will be converted to ZERO (dito for "inf", "nan", ...)

What is the deal with ("+nan" < 2) and ("+nan"+0 < 2)?

In the first case, a pure string comparison is done, while in the second case the string is forced to a number and a numeric comparison is done. This is similar to ("2.0" == 2) and ("2.0"+0 == 2). Where the first returns false, the second returns true. The reason for this behaviour is that in the first case, awk only knows that "2.0" is a string, it does not check its content, hence it converts 2 to a string.

BEGIN { print ("-nan" < 2)  , ("-nan" > 2)  , ("+nan" < 2)  , ("+nan" > 2)
        print ("-nan"+0 < 2), ("-nan"+0 > 2), ("+nan"+0 < 2), ("+nan"+0> 2)
        print ("-nan"+0 )   , ("-nan"+0)    , ("+nan"+0)    , ("+nan"+0)   }
1 0 1 0
0 1 0 1
nan nan nan nan

How to check for inf or nan:

function isnum(x) { return x+0 == x }
function isnan(x) { return (x+0 == "+nan"+0) }
function isinf(x) { return ! isnan(x) && isnan(x-x)  }
BEGIN{inf=log(0.0);nan=sqrt(-1.0);one=1;foo="nano";
    print "INF",   inf , isnum(inf)   , isnan(inf)   , isinf(inf)
    print "INF",  -inf , isnum(-inf)  , isnan(-inf)  , isinf(-inf)
    print "INF", "+inf", isnum("+inf"), isnan("+inf"), isinf("+inf")
    print "INF", "-inf", isnum("-inf"), isnan("-inf"), isinf("-inf")
    print "NAN",   nan , isnum(nan)   , isnan(nan)   , isinf(nan)
    print "NAN",  -nan , isnum(-nan)  , isnan(-nan)  , isinf(-nan)
    print "NAN", "+nan", isnum("+nan"), isnan("+nan"), isinf("+nan")
    print "NAN", "-nan", isnum("-nan"), isnan("-nan"), isinf("-nan")
    print "ONE",   one , isnum(one)   , isnan(one)   , isinf(one)
    print "FOO",   foo , isnum(foo)   , isnan(foo)   , isinf(foo)
}

This returns:

INF -inf 1 0 1
INF inf 1 0 1
INF +inf 1 0 1
INF -inf 1 0 1
NAN -nan 1 1 0
NAN nan 1 1 0
NAN +nan 1 1 0
NAN -nan 1 1 0
ONE 1 1 0 0
FOO nano 0 0 0

We can be convinced that the function isnan(x) works as expected when investigating the source code of cmp_awknums (added some comments to explain):

int cmp_awknums(const NODE *t1, const NODE *t2)
{
    // isnan is here the C version
    // this ensures that all NANs are equal
    if (isnan(t1->numbr))
        return ! isnan(t2->numbr);
    // this ensures that all NANs are bigger than any other number
    if (isnan(t2->numbr))
        return -1;
    // <snip>
}