4
votes

I'm writing a compiler that matches, within {} scope, by and large C99's semantics. When trying to reverse-engineer how gcc handles certain 'undefined behaviour', concretely, chained pre- and post-increments of variables, I noticed that it gets hopelessly confused if you combine this with modifying assignments (e.g., "*=") and array access. Simplifying to the easiest point of apparent utter confusion, gcc 4.6.3. evaluates (with and without option -std=c99):

a[0] = 2;
a[0] *= a[0]++;

to

a[0] = 3.

Am I mis-remembering the standard incorrectly? Is any use of a pre- or post-increment already undefined, not only chained use in a compound expression?

Also, even if the behaviour is 'undefined', the above seems like a particularly poor way of calculating the result as I could only see how you'd justify a result of 5 ( = 2*2 + 1, what I would have implemented - post-increment after an assignment statement), or 6 ( = 3 * 2, use a variable, then immediately post-increment it, and process in the order of parsing - the parser is almost certain to evaluate the "*=" after evaluating the RHS expression). Any insight into this - from a C or C++ angle?

I noticed this when trying to combine arrays with integer expression bounds with pre- and post-increments, and realize this is really hard; but still, the above seems a little bit like a cop-out considering the flagship status of gcc.

This is under Ubuntu 12.04.

Edit: I should have added that gcc's behavior can be reverse-engineered if the variable is not an array element - at least all examples I tried work as follows: (1) evaluate all compound expression pre-increments; (2) evaluate the expression; (3) evaluate all compound expression post-increments. So it probably has to do with the 'really hard' above as well.

Note: clang produces the philosophically reasonable value of 6. I ran more elaborate cases with clang, and am reasonably certain that it treats the array access and scalar case the same, and operates as in what I described above as the second philosophically reasonable way.

4
There's not much point in trying to get 'correct' undefined behavior, since there's really no such thing. One of the main reasons C and C++ have so much undefined behavior is to give implementers freedom to optimize (or not force them to pessimize) many operations. In other words, undefined behavior is permission for the compiler to 'cop out'.Michael Burr
I don't see why you consider that result mysterious. The mutations in an modifying-assignment and in a post-increment can happen at any time in the evaluation of the expression following the initial access; so the mutations in *= and ()++ are not ordered. Apparently in this case gcc chooses to do them left-to-right.rici
@rici: And he's asking why.Lightness Races in Orbit
@rici: I don't think he wants guesswork; he wants someone who knows the answer to post the answer. Said "someone" is more-than-likely a GCC contributor. Or, someone with enough knowledge to make a good guess, and from the looks of your answer, that's you. :)Lightness Races in Orbit
The compiler is doing __tmp = a[0] + 1; a[0] *= __tmp; a[0] == __tmp; which is just a valid ordering of instructions.David Rodríguez - dribeas

4 Answers

4
votes

The mutations in assignments (including read-and-mutate assignments such as *=) and in post-increments can happen at any time in the evaluation of the expression following the initial access of the value of the mutated cell. Consequently, the mutations to a[0] in *= and a[0]++ are not ordered with respect to each other. Apparently in this case gcc chooses to do them left-to-right.

Or to put it more precisely, the expression:

x *= y++;

can be rewritten as:

tmp1 = y + 1;        tmp2 = x * y;
x = tmp1;            y = tmp2;

where the columns can be interleaved in any order.

Note that in practice, both of the tmp variables are probably machine registers. In fact, what is probably going on is more like this:

r1  = y;
r2  = r1 + 1;
r3  = x;
r4  = r3 * r1;
x = r4;
y = r2;

So why do the last two assignments in that order rather than the other order? Well, why not?

Obviously, the confusion here is that x and y are the same location, but gcc is not obliged to notice that. It probably uses an optimization heuristic based on the assumption that they are different locations.

But suppose that it did notice that they are the same. In that case, one or other of the assignments can be eliminated. And furthermore, the computation of the temporary which is to be stored can also be eliminate. So the compiler would have the options of eliminating r4 = r3 * r1; x = r4 or r2 = r1 + 1; y = r2. With the choice of eliminating an increment or a multiplication, what would a self-respecting optimizer do?

3
votes

Am I mis-remembering the standard incorrectly? Is any use of a pre- or post-increment already undefined, not only chained use in a compound expression?

From the horse's mouth (2011 draft):

6.5 Expressions
...
2 If a side effect on a scalar object is unsequenced relative to either a different side effect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undefined. If there are multiple allowable orderings of the subexpressions of an expression, the behavior is undefined if such an unsequenced side effect occurs in any of the orderings.84)

84) This paragraph renders undefined statement expressions such as
  i = ++i + 1;
  a[i++] = i;
while allowing
  i = i + 1;
  a[i] = i;

Undefined means undefined; the compiler is not obligated to produce a meaningful or logical result (it can if it wants to, but it doesn't have to); it's not even required to reproduce the same result for each run. gcc does something, but it's likely that something is pretty random given all the freedom for ordering operations.

Note that a[0] *= a[1]++ would be well-defined (provided neither a[0] nor a[1] contain trap representations, anyway).

Edit

As to how you could get 3 out of all that, I have an idea...

  1. r1 <- a[0] (evaluate LHS)
  2. r2 <- a[0] (evaluate RHS)
  3. r1 <- r1 * r2 (perform multiplication)
  4. a[0] <- r1 (write result of multiplication to LHS)
  5. a[0] <- r2 + 1 (apply side effect to RHS)

Presto; you've clobbered the result of the multiplication because of how the side effect is applied. Whether gcc actually does something like that is an open issue, and either way it's irrelevant; undefined behavior doesn't have to be consistent or repeatable.

2
votes

The (sub-)expression a[0]++ has a value and a side effect: the value is 3; the side effect is changing the value of a[0] to 3. Note that the side effect can occur anytime between the enclosing sequence points (the end of the previous statement and the end of the current statement).

The (sub-)expression a[0] *= <value> has a value and a side effect: the value is /* old value of */ a[0] * <value>, the side effect is changing a[0] to the previous value of a[0] multiplied by <value>.

Note that you have two (sub-)expressions attempting to change the same object between sequence points. The C Standard specifically says only 1 such change is allowed.

If you are writing a compiler, you are free to detect such use and report it (with a warning or error), or not detect it and do anything (or even nothing -- ignoring the statement is perfectly fine).
anything includes making the compiler produce an executable the prints "Hello, World!" or abort the compiler itself.

-1
votes

It's called "undefined behaviour". You are modifying the same object a[0] twice in the same statement (without intervening sequence points), once by using the *= operator, once by using the ++ operator. The behaviour is undefined. Anything can happen, including your computer crashing or worse.

Once behaviour is undefined, arguing about what the compiler did is completely pointless. It's a bug in your code. Fix it.