[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Behavior of utf8.offset overflow
- From: Hisham <h@...>
- Date: Sun, 23 Apr 2017 04:06:27 -0300
The index argument of string.sub overflows gracefully and gives me an
empty string:
> string.sub("hello", 10)
> string.sub("hello", 5)
o
To match on UTF-8 characters, however, I need to convert the index via
utf8.offset.
> string.sub("héllo", 5)
lo
> string.sub("héllo", utf8.offset("héllo", 5))
o
However, the conjunction of string.sub and utf8.offset is not as
forgiving as the ASCII version:
> string.sub("héllo", utf8.offset("héllo", 10))
stdin:1: bad argument #2 to 'sub' (number expected, got nil)
stack traceback:
[C]: in function 'string.sub'
stdin:1: in main chunk
[C]: in ?
Using utf8.offset to convert ASCII indices to UTF-8 indices with
string.find produces an even more surprising result:
> string.find("hello", "o", 10)
nil
When we begin our search past the end of the string, we simply don't
find what we're looking for, as expected.
> string.find("hello", "o", utf8.offset("hello", 10))
5 5
When we replace the init argument with utf8.offset(s, init), we now
get a nil in case of overflow, causing the init argument to be
ignored.
Since the utility of utf8.offset is essentially to give UTF-8-aware
indices to the other string.* functions, it could be a nicer behavior
if utf8.offset(s, i) returned #s+1 in case the desired offset is past
the end of the string, instead of the current nil. This would make
utf8.offset a drop-in replacement for the indices in all string
functions.
(Side note: string.byte is a bit funny in that it returns _no value_
in case of overflow (not nil)
> string.byte("hello", 10)
> type(string.byte("hello", 10))
stdin:1: bad argument #1 to 'type' (value expected)
stack traceback:
[C]: in function 'type'
stdin:1: in main chunk
[C]: in ?
> x = string.byte("hello", 10); print(x)
nil
It's not clear to me what are the criteria for functions to return no
value versus nil -- as seen above, string.find returns an explicit
nil.)
In any case, this is what we get if we don't check for nil with the
current behavior:
> string.byte("hello", utf8.offset("hello", 10))
104
For completeness, the other string function that uses an index is
string.match, and it behaves like string.find:
> string.match("hello", "o", 10)
nil
> string.match("hello", "o", utf8.offset("hello", 10))
o
I'm not sure if this is an improvement worth breaking compatibility,
but I thought I'd share the observation. Adding extra nil-checks in
the code is annoying (and forgetting them is easy).
-- Hisham