[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: proposal for reading individual characters from strings faster
- From: Coroutines <coroutines@...>
- Date: Sat, 3 May 2014 05:09:14 -0700
On Sat, May 3, 2014 at 4:26 AM, Philipp Janda <siffiejoe@gmx.net> wrote:
> We already have long strings and short strings in Lua 5.2. What about an
> unhashed "very short" string (7 bytes plus NUL byte) that lives directly in
> a TValue?
Firstly, I want to say that I think your proposal is an interesting
tangent worth persuing -- but please note that it does not supersede
anything related to the first proposal I made: `?` -> string.byte('?')
compile-time syntax sugar
I like your proposal, but I feel (without benchmarking) that comparing
2 integers would be quicker than first finding out if a short string
is long enough to cast the comparison to 2 integers/doubles
(word/dword comparison). Most libc strcmp()'s still do byte-by-byte
comparisons, which would be slower than comparing 4 or 8 bytes between
2 lua_Number's. Sidenote: Let's make it 16-byte short strings with
long-long comparisons - possible on x86_64 anyway... :(
> It should get rid of the hashing overhead for single character
> strings (not sure how much hashing there is for single-byte strings), but
> not the call overhead of string.byte, though ...
Between two "short strings" it would be a strcmp()? If it were
between a short string and a long string you would still have to hash
the short string for the comparison.
-----
Unrelated: I think I'd modify my first proposal -- `a` should be
UTF8-aware and return the integer codepoint (the facility is in 5.3!),
but some_string[3] should return the integer at byte index 3, and not
look for a UTF8 codepoint
My reasoning is that the 2nd proposal is in the interest of reading
bytes as quickly as possible, without the call overhead of
string.byte(). Adding UTF8 detection to that would only slow it back
down. If one wants to index by UTF8 character they can do that with
the function available in 5.3, but it will always be slower than a
single-byte index. I would want `` to be UTF8 aware because you can
use that function 5.3 with `` for comparisons if you are working with
non-ASCII.