Speeding Up Strings |
|
There have been repeated discussions on lua-l (for example [1]) about the cost of passing strings from C
into Lua
, with the implication that it is a major overhead. This page seeks to document an experiment in modifying Lua to reduce string creation overhead by implementing "prestrings" -- lazily interned strings. A sample API is outlined, and the results of a benchmark test of a proof-of-concept implementation of the API are presented.
The proposed API cannot be used safely in a threaded environment without severe restrictions which might be difficult to guarantee. In addition, the benchmark results do not indicate the potential for significant speed-up in common use cases.
In short, this is more of a failed experhiment than a suggestion, but it's useful to report failures as well in order to avoid reinventing flat wheels.
Basically, the process the Lua API follows when importing a string into Lua, either with lua_pushlstring()
or as the result of a primitive which creates a new string, is:
1) A hash is computed out of a sample of no more than 32 characters of the string.
2) Using the hash, the string is looked up in the global interned-string hash table.
3) If an existing string is found, then it is used.
4) Otherwise, a new TString
object is allocated and the string is copied into it.
The new object is inserted into the intern table using the hash value computed
at step 1.
This works particularly well for shortish strings which are likely to already exist in the intern table, such as global variable names and other table keys, since it avoids unnecessarily allocating or copying the provided string, although it will incur the cost of a full comparison of the new string with one existing string.
It also works reasonably well for strings which are unlikely to already exist in the intern table since the hash lookup will probably never compare more than a few characters of any existing string on the chain, although it will incur the cost of a memory allocation and a copy.
Nonetheless, there remain some use cases where these overheads seem like they
might be excessive. Some people worry that the cost of interning is too high
for a string which will never be used as a table key, while others focus on
the copy in step 4, above, which may be unnecessary if the string is going to
be constructed piecemeal or created by some other library which expects a
fixed-length buffer (for example, fread
.)
Since I belong to the second group, I thought it might be useful to find a way to allow the creation of a new string without copying. On the surface, this looks easy: you simply create something which looks like a string (or at least is the same size as a string would be) and provide the starting address; the API consumer can then fill the contents in as desired and then convert the object into a real string, by interning it.
That suggested a simple API:
lua_newprestring(L, size); lua_tostring(L, index); /* Note that lua_tostring is a macro which expands * to lua_tolstring(L, index, NULL); */
lua_tolstring
for this purpose seemed clear and obvious.
Now, the prestring is a real Lua object, so it could theoretically be returned
from a C function as an uninterned string. And since it is turned into a real
string by calling lua_tolstring
on it, and since lua_tostring
is the API
for extracting a string's address and length, no modification of standard
library functions is necessary in order to use prestrings, although it turns
out to be useful to have an API call which allows the use of the prestring
without interning it:
lua_rawtolstring(L, index, &len);
I could have hijacked lua_topointer
, but lua_topointer
doesn't return
the string length; furthermore, it's good to know that the object at a
stack index is actually a string, and not just anything which might
have a pointer.
My first attempt to use the API was to write a file
method :buffers
,
which is a fixed-length analog to :lines
:
for b in f:buffers(2000) do -- #b is 2000 unless there are fewer than -- 2000 characters to read end
:buffers
could just return a string, but it seemed more interesting
to return a prestring, in case the use of the return value didn't require
interning: say, if it were just going to be sent to some output. In that
case, :buffer
could even reuse the same prestring on each iteration,
provided that it hadn't been stringified. (This is not entirely safe since the
client program could keep a reference to the buffer, even though it
hasn't been interned, and might get surprised that the value changes.)
That meant that the implementation of :buffer
needs to be able to know
whether or not the prestring has been stringified, and that required
another API:
lua_rawtoprestring(L, index, &len);
Using lua_rawtoprestring
and lua_newprestring
, :buffer
can
decide to either recycle a prestring or create a new one, but for reasons
which now seem hard to justify, I added yet another API which combines
the two actions; the most semantically complicated API call in the
implementation:
lua_toprestring(L, index, &len);
index
and:
1) if it is a prestring, does not modify it;
2) if it is a string, creates a new prestring with the same length
(without copying contents) and replaces stack slot index
with
the new prestring;
3) if it is anything else, creates a new prestring whose length is
the value currently pointed to by the &len
argument, and
replaces stack slot index
with the new prestring;
4) finally, regardless of which of the above was performed,
returns the address and length of the prestring at stack slot
index
Although that's hard to explain, and maybe hard to understand, it's quite easy to use. For example, the implementation of :buffers is as follows:
static int io_readbuf (lua_State *L) { FILE *f = *(FILE **)lua_touserdata(L, lua_upvalueindex(1)); if (f == NULL) luaL_error(L, "file is already closed"); else { size_t nread; size_t toread = lua_tointeger(L, lua_upvalueindex(2)); char *buf = lua_toprestring(L, lua_upvalueindex(3), &toread); nread = fread(buf, 1, toread, f); if (nread == toread) lua_pushvalue(L, lua_upvalueindex(3)); else if (nread > 0) lua_pushlstring(L, buf, nread); else if (ferror(f)) luaL_error(L, "%s", strerror(errno)); else lua_pushnil(L); } return 1; } static int f_buffers (lua_State *L) { size_t bufsize; tofile(L); bufsize = luaL_optinteger(L, 2, LUAL_BUFFERSIZE); lua_settop(L, 1); lua_pushinteger(L, bufsize); lua_pushnil(L); lua_pushcclosure(L, io_readbuf, 3); return 1; }
A key part of this entire process is that the prestring is converted
in place into a string either explicitly or when needed. Prestrings
don't have a separate type; they look at all times like ordinary strings,
unless one of the above prestring-aware API calls is used. Existing
C modules and Lua programs do not need to be modified in any way to take
into account the existence of prestrings, and if you want a C module to avoid
autointerning a prestring, one only needs to replace lua_tolstring
with
lua_rawtolstring
.
Since a prestring is really a lazily-interned string, it has to act as though it were a string the Lua environment. That means that it has to be interned in order to be used as a table key. (The proof-of-concept implementation also auto-interns prestrings which are arguments to ==.)
Once a prestring is interned into a string, it must not be changed. So it's not really safe to modify a prestring which has been released into the Lua environment. You must finish creating the contents of the prestring before letting any Lua or CFunction see the prestring.
That's fine in a non-threaded environment, and the above implementation
of :buffers
will work; it is careful not to modify the prestring without
checking to make sure that it still is a prestring. But it cannot be made
to work safely in a threaded environment; the :buffers
iteration function
does not run inside of the lua_lock
and another process which had a
reference to the buffer could stringify it while the call to fread
were
in progress.
That's unfortunate because the benchmarks I ran against the proof-of-concept implementation show that the only significant speed-up available is through recycling prestrings in a hard loop. The overhead of interning and even copying strings is just not enough to justify the complexity of the implementation (in my opinion).
The first benchmark I tried was reading and writing file buffers, using
a function rather similar to :buffers
as shown above. The iolib
was modified by adding the :buffers
iterator and replacing a few
calls to lua_tolstring
with lua_rawtostring
. However, all of
the timings were dominated by the I/O system calls, so the results
were not conclusive, except in the sense that they demonstrate that in
a normal usage with file I/O, the savings are not even going to be
noticeable.
In order to try to measure more exactly what the benefits might be, I created a sort of pseudo-file, which consists of just under 16K random printable characters in a ringbuffer. The ring buffer is "read" by copying the number of requested bytes and advancing the internal "file" position, always modulo the ringbuffer size (which is a prime). Periodically, the ring buffer is rerandomized; the end result is that it is extremely unlikely that a given "read" will result in an existing string.
Using the above API, there are at least four possible ways of returning a buffer of characters:
1) "recycle": Always return the same prestring, with new contents
2) "prestring": Return a new prestring each time, but don't intern it
3) "inplace": Create the new string by first creating a prestring of the correct size, then reading data into it, and finally turning the prestring into a string before returning it.
4) "string": The current mechanism: the file is read into a prestring
(although it could just as well have been a preallocated buffer), and
then lua_pushlstring
is called to create a string in the normal way.
Comparing the times for these four strategies reveals the cost of
1) allocating memory (the difference between "prestring" and "recycle")
2) interning (the difference between "inplace" and "prestring"); and
3) copying (the difference between "string" and "inplace").
The benchmark was done rather unscientifically on a laptop running
Xubuntu (I tried to avoid using the machine while the benchmark was
running, but there were still lots of processes active). Lua was
compiled with -O2 -DLUA_USE_APICHECK
(it should have been
compiled without LUA_USE_APICHECK
although I don't think that
affects the relative timings; obviously, the execution times reported
are slower than they should be in a production build. The machine is
an AMD Turion running in a 64-bit environment. For each buffer size,
the simulated file size was set to 1,000,000 buffers and execution
time was collected with os.clock
, so the results can be read as
µs per buffer. Each test was run ten times and the fastest time
was selected for the summary; the results are summarized in
Files:wiki_insecure/users/rici/buftest.pdf. As can be seen, the
interning time is on the order of 200-250 nanoseconds per string,
which is only noticeable for very short strings; the copying time
is at worst 13% of the total, with the largest buffers tested.
The implementation is not particularly polished, and since the results were not spectacular, I'm unlikely to continue develop it. In case anyone is interested, I've put a patch at Files:wiki_insecure/users/rici/prestring.diff
Some notes on the implementation:
Every garbage collectable object has a next
member which Lua uses
to perform the sweep phase of the garbage collection. Objects other
than strings are kept in a singly-linked list whose head is in the
global state. Strings, however, are kept in the string intern table,
and the next
member is used to link together hash chains. (In effect,
the head of each hash chain is part of the garbage collection root.)
Once a prestring has been interned, it must be removed from the allocated
object linked list and linked into the appropriate hash chain in the intern
table. The only way I could think of doing that was to doubly-link the list
of allocated prestrings. The back pointer is kept in a union
with the
hash
member, which is not needed by prestrings. On 32-bit architectures,
this has no cost, but on 64-bit architectures it increases the string overhead
from 24 bytes to 32 bytes, since hash
is an int
. Possibly it would
have been better to just assume that there will never be a lot of prestrings,
and to keep them in a singly-linked list which is just searched linearly when
a prestring is interned.
Other than that, I don't think there are any GC issues. Like strings, prestrings are always created on the stack and can be created white since the stack is marked atomically.
The lock mechanism offered by the lua_lock
and lua_unlock
macros is
designed to avoid garbage collection or table resizing from interfering
with API calls or VM execution. Lua does not attempt, even in theory, to
prevent race conditions from other object mutations.
Prestrings introduce a new form of mutation: changing a mutable object
(a prestring) into an immutable object (a string). As noted above, the
lua_lock
mechanism cannot prevent a race between changing the
contents of a prestring and changing a prestring to a string.
Prestrings have the same type as strings (LUA_TSTRING
) and have the
same memory layout other than the hash
member. One flag was added to
the object header to distinguish prestrings from strings.
When a prestring is interned, it is possible that the result will be an existing string. In this case, we'll end up with two objects: the prestring and the old string. This could result in the same prestring being interned multiple times. To reduce the possibility of this happening, the autointerning done by key lookup and the equality operator both reset the string pointer to the old string; implementing that required changing a number of internal prototypes from const to non-const, and without doing much to solve the problem.
I think it would probably better to stash a forwarding pointer in the prestring (using the overloaded back pointer/hash) and rewriting the string pointer during garbage collection. (The garbage collector could not rewrite pointers in the stack frame of a C call but ought to be able to rewrite them elsewhere. Rewriting a pointer in a C call stack frame could result in the prestring being freed by the garbage collector while the C function was using a pointer to the prestring.)
--RiciLake