[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: proposal for reading individual characters from strings faster
- From: Coroutines <coroutines@...>
- Date: Sat, 3 May 2014 03:59:50 -0700
1) I'd like to see `a` (note the backticks) be a run-once form of
string.byte('a') -- a single-character transform that happens at
compile-time. When you use string.byte() in a loop it'll be run every
time, and I argue that creating a local for this purpose is difficult
to name and largely a waste of a local.
2) I'd like the __index of the string type to change so if called with
a numeric key it'll do this [essentially] in a C function: ('cat')[3]
-> string.byte('cat', 3, 3) This could be done more cheaply in C than
with the additional call overhead of a string.byte() in Lua for
tight/large loops.
It's very costly to create a lot of single-character Lua strings that
get hashed and then quickly discarded after a comparison -- this is
what I've come up with to help:
Usually I'm iterating through a string on a character basis like this
when I'm doing something with parsers -- and it's much easier to
prototype algorithmically in Lua before (if ever) moving it to C.
Single-character modifications to the string are not possible -- the
hash would need to change and if it's referenced from several places
this would be a CoW issue. We can do single-character reads as
integers for quick comparisons, though.
The string library's metatable would have to change like so:
__index =
function (self, k)
if type(k) == 'number' then
return cFunc_char_at(self, k)
end
return string[k]
end
If there's a faster way to detect a number I would love to hear it.
These 2 feature requests would suit the switch in a basic recursive
descent parser well:
impl.read_value =
function (peek, read)
local c = peek() -- self[next_idx]
if c == `"` then return impl.read_string(peek, read)
elseif c == `{` then return impl.read_object(peek, read)
elseif c == `[` then return impl.read_array(peek, read)
elseif c == `t` then return impl.read_constant(peek, read,
'true', true )
elseif c == `f` then return impl.read_constant(peek, read,
'false', false )
elseif c == `n` then return impl.read_constant(peek, read,
'null', ljson.null)
elseif impl.is_number_char(c) then return impl.read_number
(peek, read)
else
assert()
end
end
~Okay, not a very good example -- I pulled it from one of my projects
that needs more work refining~
While you can do this with string.gsub(), gsub() would create those
dreadfully small, temporary, single-character substrings to pass to the
function/table you'd use for its 3rd argument.
The objective of these proposals are to make individual character
matching faster, without the call overhead of string.byte().
Thoughts? (I thought this would be better than some of the more
radical ideas from the past of introducing a 'char' type)
PS: Please don't just reply "You can do this with gsub()!" I contend
that it can be natural to use either approach, depending on the
situation. Sometimes the functional form of gsub() makes sense,
sometimes I feel like this would look cleaner (and be faster over
invoking the pattern matching facilities of gsub).