Mike Pall <mikelu-0703@mike.de> writes:
David Kastrup wrote:
Such strings would not get hashed unless one used them as a table
index, and they would not get unified or copied either unless
necessary.
The problematic words are "unless necessary". So how/where do you
decide when the memory holding the string data is no longer valid
and needs to be copied? (Well, a pooling approach might work.)
Stop. I was talking about hashing, you are talking about copying.
Two different things. While I admit that avoiding copying might at
some point also be interesting, this is a problem that affects
performance orders of magnitude less.
If one fears that memory will fill up with identical strings
otherwise, one can hash and fold strings as part of garbage
collection, but that's a different story.
But what about dangling pointers on the C side from previous calls
to lua_tostring()?
If we are talking about copy avoidance, it might be possible to have
the C routine pass a callback to lua_tostring_inplace for when the
string is no longer needed (if allocated via malloc, this could be
free when C itself does not need to retain the string separately from
Lua). And have a routine that C can call in order to tell Lua that
the memory containing string is going out of scope _now_ (like when it
is allocated with alloca and C is about to exit the block where this
has been done), in which case Lua will create its own copy of the
contents if it still needs the string and has not copied it yet.
If the string is a string constant, the callback would be a routine
doing nothing.
And you still need to allocate a fixed size object on the heap for
every lua_pushlightstring(). And then you have the added complexity
of dealing with two different string types everywhere in the Lua
core. I'm not sure this pays off, especially for small strings,
which make up the majority in most scenarios.
Lazy strings could be the default, then no separate type would be
required.
For huge strings (like whole files) a string buffer approach (using
userdata objects) might be more appropriate. But then you'll face
the problem that most library functions expect strings and won't be
able to use this object type. Some ideas about how to rectify this
have been floating around (but one really needs to avoid copying the
data).
Being able to avoid _hashing_ the data until needed would already make
quite a difference. Avoiding copying appears possible with some
effort and additional string creation functions, at the cost of laying
some more management tasks at the feet of the programmer wanting to
make use of this.
OTOH you could try to speed up string hashing (luaS_newlstr). I've
personally given up on this. Either the hash quality deteriorates too
much or the compiler produces slower code.
One should open-code the special case of stride 1, and one should do
the hashing from the start rather than from the end of the string.
That would already speed up the most common cases.
But of course being able to avoid calculation of the hash code and
looking it up in the global string table (with the occasional
resulting need to grow this table) would save much much more time in
the case of throwaway strings not used for indexing.