Re: How does string.format handle undefined behavior?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: How does string.format handle undefined behavior?
From: Lorenzo Donati <lorenzodonatibz@...>
Date: Mon, 6 Sep 2021 19:17:59 +0200

On 06/09/2021 15:50, Roberto Ierusalimschy wrote:

The whole mess of UB is just that: people thinks "most implementation won't
do something silly in this case", then you find the "right" compiler
switch, the "right" compiler version, the "right" DLL linked-in, the
"right" C-lib version and some years down the road something goes horribly
wrong.


If you follow this line up to its logical end, it becomes impossible
to program in C.

A small illustration: as far as I can find, the standard says nothing
about the possibility of a stack overflow due to too many pending
calls. There is no way to check it, there is no ensured minimum,
it is not defined as undefined, nothing. I see two ways to interpret
this.


Well, I agree that a standard cannot cover absolutely everything, and

C doesn't even try. In fact everything the standard says is about theabstract machine, not a real, physical machine, as stated in "5.1.2.3Program execution".

However they went a long way to define what is "UB(TM)" versus what"common people" call "undefined behavior" (in the sense that no-one hasdefined it). Can we call it "plain UB"?

So I guess the committee esplicitly marked as "UB(TM)" those areas wherethey deemed that giving an implementation absolute freedom willallow room for foreseeable optimizations or areas where defining abehavior would have been too burdensome for compiler makers.

So, AFAIU, the committee defined as "UB(TM)" only those things thatactually can be avoided by a programmer (although sometimes with extremecare). In fact they stated that if a program contains even an instanceof "UB(TM)" the program is considered erroneous, unless that "UB(TM)"has been defined by the implementation as an extension, in that case theprogram is declared "non-portable".

In the case of "plain UB", i.e. those cases which could wreak havoc butabout which the standard is silent, then I assume they all fall underthe "implementation detail" hat.

So your counterargument is right if we consider "plain UB". However, ifwe stick to just avoiding "UB(TM)", then it must be possible (bydefinition), otherwise the committee would be implicitly declaring everyC program as erroneous because of this purported impossibility.

As I said, the standard terminology choice is unfortunate in that itgives an extremely precise meaning (UB(TM)) to a general term used inprogramming (plain UB). They could have chosen other terms, but alas weare stuck with that.


In particular, see the definition in C99 (N1256 draft):

-----------------------------------------------------
3.4.3

1 undefined behavior

behavior, upon use of a nonportable or erroneous program construct or oferroneous data, for which this International Standard imposes norequirements


2 NOTE Possible undefined behavior ranges from ignoring the situation
completely with unpredictable
results, to behaving during translation or program execution in a
documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to
terminating a translation or
execution (with the issuance of a diagnostic message).

-----------------------------------------------------

So a case of UB(TM) is NOT necessarily plain UB (ugh!), but is a termused to flag erroneous or nonportable constructs or data.

Option 1: As the standard never mentions that a function call can
go wrong due to "stack overflow" (too many pending calls), then all
function calls should work as described, no matter how many pending
calls there are in the execution. As they don't, it follows that
all compilers we know about are badly buggy.

Option 2: We accept that the number of pending function calls have some
implicit limit, and once a program crosses that limit we have some
undefined behavior. As the standard does not set a minimum for this
limit (which does not even exist, according to the standard), it can be
be any value. A single call to 'printf' in helloword.c can legitimately
cause a stack overflow and therefore undefined behavior. (The standard
also offers no way to check this limit.) If we cannot accept UB, no
matter what, then we should never call any functions in our programs.
It doesn't matter that such calls always worked in all compilers
we ever used; some years down the road something can go horribly
wrong, and we have only ourselves to blame.

-- Roberto

Your example about the stack depth limit is not covered by the standardbecause the abstract machine doesn't even have a stack concept.

FWIW, the abstract machine doesn't even have the concept of differentaddress spaces, so accessing data in different address spaces, e.g. inthe flash memory of a MCU instead of its RAM, usually uses non-portablesyntax that is compiler-specific.

So programming in C on a real machine requires the knowledge of BOTH theabstract machine AND the real machine. The standard only requires somuch from an implementation and hopefully defines every relevant aspectsof the abstract machine that allows avoiding (possibly with greatefforts by the programmer) any UB(TM).

Once you get rid of all UB(TM) is your program necessarily correct? No,because it could be non-erroneous from the standard perspective, butstill buggy because you didn't take into consideration the limits or thecapabilities of the real machine, about which the standard doesn't givea damn. [1]

As I said, It took me literally years to grasp the "UB(TM)" meaning (andsometimes I'm still puzzled), putting together pieces of informationfound in lots of articles read here and there.

BTW, here's a nice article (by renown John Regehr and Pascal Cuoq) aboutthe problems of detecting and getting rid of UB in C and C++ programs.Bottom line: sure it's (sometimes very) hard, but not impossible inprinciple.


https://blog.regehr.org/archives/1520

It ends with this:

"Unfortunately, C and C++ are mostly taught the old way, as ifprogramming in them isn’t like walking in a minefield. Nor have thebooks about C and C++ caught up with the current reality. These thingsmust change.


Good luck, everyone."


Cheers!

-- Lorenzo

[1] It is a common complaint from embedded system programmers that Cdoesn't allow to define the exact sequence of some operations as theyare performed when translated to machine code. Thus forcing usingassembly snippets in critical code paths.


For example, assuming x and y are 16 bit quantities on an 8 bit MCU,
if you write:

x = <expr1>;
y = <expr2>;

there is no way in C99 to ensure that the updating of x happenscompletely before the updating of y (the upper 8 bits and the lower 8bits of each can be modified in any order-usually for optimizationpurposes).

If x and y are HW registers that need to be accessed in a specificorder, you HAVE to use assembly.And declaring x and y volatile doesn't help. This atomic updatingproblem is addressed only in some later standard IIRC, where some atomictypes are introduced.

Failing the correct sequencing could bring the system to a halt orgenerate a HW exception (maybe depending on the timing on some externalevent), for example, and this is clearly a "plain UB", but absolutelynot an UB(TM), since the abstract machine state is not concerned by whatx and y are mapped to.

Follow-Ups:
- Re: How does string.format handle undefined behavior?, Roberto Ierusalimschy
- Re: How does string.format handle undefined behavior?, Viacheslav Usov

References:
- Re: How does string.format handle undefined behavior?, Egor Skriptunoff
- Re: How does string.format handle undefined behavior?, Lorenzo Donati
- Re: How does string.format handle undefined behavior?, Roberto Ierusalimschy

Prev by Date: Re: incorrect "error" signaled by type(value), when the argument value is nil
Next by Date: Re: How does string.format handle undefined behavior?
Previous by thread: Re: How does string.format handle undefined behavior?
Next by thread: Re: How does string.format handle undefined behavior?
Index(es):
- Date
- Thread