Behavior of utf8.offset overflow

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Behavior of utf8.offset overflow
From: Hisham <h@...>
Date: Sun, 23 Apr 2017 04:06:27 -0300

The index argument of string.sub overflows gracefully and gives me an
empty string:

> string.sub("hello", 10)

> string.sub("hello", 5)
o

To match on UTF-8 characters, however, I need to convert the index via
utf8.offset.

> string.sub("héllo", 5)
lo
> string.sub("héllo", utf8.offset("héllo", 5))
o

However, the conjunction of string.sub and utf8.offset is not as
forgiving as the ASCII version:

> string.sub("héllo", utf8.offset("héllo", 10))
stdin:1: bad argument #2 to 'sub' (number expected, got nil)
stack traceback:
    [C]: in function 'string.sub'
    stdin:1: in main chunk
    [C]: in ?

Using utf8.offset to convert ASCII indices to UTF-8 indices with
string.find produces an even more surprising result:

> string.find("hello", "o", 10)
nil

When we begin our search past the end of the string, we simply don't
find what we're looking for, as expected.

> string.find("hello", "o", utf8.offset("hello", 10))
5    5

When we replace the init argument with utf8.offset(s, init), we now
get a nil in case of overflow, causing the init argument to be
ignored.

Since the utility of utf8.offset is essentially to give UTF-8-aware
indices to the other string.* functions, it could be a nicer behavior
if utf8.offset(s, i) returned #s+1 in case the desired offset is past
the end of the string, instead of the current nil. This would make
utf8.offset a drop-in replacement for the indices in all string
functions.

(Side note: string.byte is a bit funny in that it returns _no value_
in case of overflow (not nil)

> string.byte("hello", 10)
> type(string.byte("hello", 10))
stdin:1: bad argument #1 to 'type' (value expected)
stack traceback:
    [C]: in function 'type'
    stdin:1: in main chunk
    [C]: in ?
> x = string.byte("hello", 10); print(x)
nil

It's not clear to me what are the criteria for functions to return no
value versus nil -- as seen above, string.find returns an explicit
nil.)

In any case, this is what we get if we don't check for nil with the
current behavior:

> string.byte("hello", utf8.offset("hello", 10))
104

For completeness, the other string function that uses an index is
string.match, and it behaves like string.find:

> string.match("hello", "o", 10)
nil
> string.match("hello", "o", utf8.offset("hello", 10))
o

I'm not sure if this is an improvement worth breaking compatibility,
but I thought I'd share the observation. Adding extra nil-checks in
the code is annoying (and forgetting them is easy).

-- Hisham

Follow-Ups:
- Re: Behavior of utf8.offset overflow, Roberto Ierusalimschy

Prev by Date: Re: Lua interface standards
Next by Date: Re: Lua interface standards
Previous by thread: Re: Lua interface standards
Next by thread: Re: Behavior of utf8.offset overflow
Index(es):
- Date
- Thread