[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: How does string.format handle undefined behavior?
- From: Lorenzo Donati <lorenzodonatibz@...>
- Date: Mon, 6 Sep 2021 19:17:59 +0200
On 06/09/2021 15:50, Roberto Ierusalimschy wrote:
The whole mess of UB is just that: people thinks "most implementation won't
do something silly in this case", then you find the "right" compiler
switch, the "right" compiler version, the "right" DLL linked-in, the
"right" C-lib version and some years down the road something goes horribly
wrong.
If you follow this line up to its logical end, it becomes impossible
to program in C.
A small illustration: as far as I can find, the standard says nothing
about the possibility of a stack overflow due to too many pending
calls. There is no way to check it, there is no ensured minimum,
it is not defined as undefined, nothing. I see two ways to interpret
this.
Well, I agree that a standard cannot cover absolutely everything, and
C doesn't even try. In fact everything the standard says is about the
abstract machine, not a real, physical machine, as stated in "5.1.2.3
Program execution".
However they went a long way to define what is "UB(TM)" versus what
"common people" call "undefined behavior" (in the sense that no-one has
defined it). Can we call it "plain UB"?
So I guess the committee esplicitly marked as "UB(TM)" those areas where
they deemed that giving an implementation absolute freedom will
allow room for foreseeable optimizations or areas where defining a
behavior would have been too burdensome for compiler makers.
So, AFAIU, the committee defined as "UB(TM)" only those things that
actually can be avoided by a programmer (although sometimes with extreme
care). In fact they stated that if a program contains even an instance
of "UB(TM)" the program is considered erroneous, unless that "UB(TM)"
has been defined by the implementation as an extension, in that case the
program is declared "non-portable".
In the case of "plain UB", i.e. those cases which could wreak havoc but
about which the standard is silent, then I assume they all fall under
the "implementation detail" hat.
So your counterargument is right if we consider "plain UB". However, if
we stick to just avoiding "UB(TM)", then it must be possible (by
definition), otherwise the committee would be implicitly declaring every
C program as erroneous because of this purported impossibility.
As I said, the standard terminology choice is unfortunate in that it
gives an extremely precise meaning (UB(TM)) to a general term used in
programming (plain UB). They could have chosen other terms, but alas we
are stuck with that.
In particular, see the definition in C99 (N1256 draft):
-----------------------------------------------------
3.4.3
1 undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of
erroneous data, for which this International Standard imposes no
requirements
2 NOTE Possible undefined behavior ranges from ignoring the situation
completely with unpredictable
results, to behaving during translation or program execution in a
documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to
terminating a translation or
execution (with the issuance of a diagnostic message).
-----------------------------------------------------
So a case of UB(TM) is NOT necessarily plain UB (ugh!), but is a term
used to flag erroneous or nonportable constructs or data.
Option 1: As the standard never mentions that a function call can
go wrong due to "stack overflow" (too many pending calls), then all
function calls should work as described, no matter how many pending
calls there are in the execution. As they don't, it follows that
all compilers we know about are badly buggy.
Option 2: We accept that the number of pending function calls have some
implicit limit, and once a program crosses that limit we have some
undefined behavior. As the standard does not set a minimum for this
limit (which does not even exist, according to the standard), it can be
be any value. A single call to 'printf' in helloword.c can legitimately
cause a stack overflow and therefore undefined behavior. (The standard
also offers no way to check this limit.) If we cannot accept UB, no
matter what, then we should never call any functions in our programs.
It doesn't matter that such calls always worked in all compilers
we ever used; some years down the road something can go horribly
wrong, and we have only ourselves to blame.
-- Roberto
Your example about the stack depth limit is not covered by the standard
because the abstract machine doesn't even have a stack concept.
FWIW, the abstract machine doesn't even have the concept of different
address spaces, so accessing data in different address spaces, e.g. in
the flash memory of a MCU instead of its RAM, usually uses non-portable
syntax that is compiler-specific.
So programming in C on a real machine requires the knowledge of BOTH the
abstract machine AND the real machine. The standard only requires so
much from an implementation and hopefully defines every relevant aspects
of the abstract machine that allows avoiding (possibly with great
efforts by the programmer) any UB(TM).
Once you get rid of all UB(TM) is your program necessarily correct? No,
because it could be non-erroneous from the standard perspective, but
still buggy because you didn't take into consideration the limits or the
capabilities of the real machine, about which the standard doesn't give
a damn. [1]
As I said, It took me literally years to grasp the "UB(TM)" meaning (and
sometimes I'm still puzzled), putting together pieces of information
found in lots of articles read here and there.
BTW, here's a nice article (by renown John Regehr and Pascal Cuoq) about
the problems of detecting and getting rid of UB in C and C++ programs.
Bottom line: sure it's (sometimes very) hard, but not impossible in
principle.
https://blog.regehr.org/archives/1520
It ends with this:
"Unfortunately, C and C++ are mostly taught the old way, as if
programming in them isn’t like walking in a minefield. Nor have the
books about C and C++ caught up with the current reality. These things
must change.
Good luck, everyone."
Cheers!
-- Lorenzo
[1] It is a common complaint from embedded system programmers that C
doesn't allow to define the exact sequence of some operations as they
are performed when translated to machine code. Thus forcing using
assembly snippets in critical code paths.
For example, assuming x and y are 16 bit quantities on an 8 bit MCU,
if you write:
x = <expr1>;
y = <expr2>;
there is no way in C99 to ensure that the updating of x happens
completely before the updating of y (the upper 8 bits and the lower 8
bits of each can be modified in any order-usually for optimization
purposes).
If x and y are HW registers that need to be accessed in a specific
order, you HAVE to use assembly.
And declaring x and y volatile doesn't help. This atomic updating
problem is addressed only in some later standard IIRC, where some atomic
types are introduced.
Failing the correct sequencing could bring the system to a halt or
generate a HW exception (maybe depending on the timing on some external
event), for example, and this is clearly a "plain UB", but absolutely
not an UB(TM), since the abstract machine state is not concerned by what
x and y are mapped to.