[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Iteration protocol
- From: Rici Lake <lua@...>
- Date: Sun, 27 Nov 2005 02:16:59 -0500
Based on my comment in the thread on nil as a table key, I did a
proof-of-concept implementation of the suggested iteration protocol.
The benchmarking I did shows speedups of 6-10%, but one test seems to
indicate a speedup of 33%.
The diff is not very big; it can be viewed here:
<https://trac.bulix.org/projects/rici/changeset/84> or downloaded as a
diff <https://trac.bulix.org/projects/rici/changeset/84?format=diff>.
The latter URL includes the benchmarking program I used
(test/benchmark.lua).
It does not include my patch for yieldable iterators, but the
equivalent patch would be much simpler for the modified iteration
protocol since the termination test could be implemented with an
unmodified OP_TEST. I didn't spend a lot of time thinking about
optimizations or things that could be removed. All I did was change
OP_TFORLOOP to use the new protocol, and modify pairs(),
string.gmatch() and io.lines(). (ipairs() needs to be modified as well,
but I didn't do that. Oops.)
The implementation differs slightly from the description given in my
earlier message, in that it stops on any false return from the
iterator, rather than only on a nil return; the change seemed feasible
with state separated from iteration variables, but it may not be the
best thing.
Here are the benchmark results, compiled with -O2 on gcc 3.3.3 running
on FreeBSD:
1) Simple for loop; iterates over 10 numeric and 10 non-numeric keys in
the same table. Presumably the speedup is simply from not having to
copy the control variable in OP_TFORLOOP.
beta-5.1: 7.02 seconds (1e6 repetitions)
modified: 6.59 seconds
saving: 6%
2) string.gmatch. Uses the standard idiom str:gmatch("%S+") to iterate
over a string with 12 short words. Here the speedup is rather more
marked, since we can avoid creating a closure on each loop.
beta-5.1: 9.02 seconds (1e6 repetitions)
modified: 8.18 seconds
saving: 9.3%
Note 1: Of course, you can't actually implement gmatch() without
creating a closure even with the suggested iteration protocol. However,
you don't have to create a closure every time; you only need one for
each pattern. The sample implementation provides the API
string.gmatcher(pattern) which returns a closure; this can be provided
to string.gmatch instead of the pattern argument. (It also allows you
to specify a starting position in the string in string.gmatch()). I
didn't spend a lot of time thinking about the API, but the results can
be inspected (see below).
Note 2: The above timings are done with os.clock(), but I also timed
the complete benchmark of a million iterations of each of the above
tests. The results:
beta-5.1: 16.12 seconds (user)
modified: 14.84 seconds
saving: 7.9%
3) A simple (ish) implementation of insertion-ordered multiple-value
tables, such as the data structure you might want to use for storing
HTTP headers or LDAP entries. This is a fairly complex piece of code,
so it ought to be some sort of indication of the impact on a real
program. The implementation differs only in the iterator; the 5.1-beta
iterator has to create a closure for each iteration, while the
modified-lua iterator does not; a single iteration-state (an integer)
is sufficient to record the iteration state.
The test:
a) converts an LDIF record (taken from the RFC) into an insert-ordered
table. The LDIF record has 12 entries and nine distinct keys.
b) deletes all entries (3 of them) with a key.
c) adds two new entries with the same key
d) makes a shallow copy (which iterates over all key/value pairs)
e) iterates over all values for each of three keys (a total of six
key/value pairs)
The results are a bit odd, so I'll just report them. There are two
measures: the first is simply the time reported by os.clock(); the
second is the total time to run the benchmark reported by /usr/bin/time
(which therefore includes parsing, test setup, etc., but most
importantly closing the lua_State at the end of the run, which garbage
collects everything. The sample was executed 100,000 times:
beta-5.1: (os.clock) 27.78 (/usr/bin/time) 37.47
modified: 26.32 25.12 (!)
saving: 5.2% 33%
There is a highly noticeable gc pause at the end of the beta-5.1 run,
so I presume that this is the result of garbage collecting all the
closures which were created. On the other hand, each iteration of the
test allocates about 10 tables, which you would think would add up
quite a bit, too; furthermore, the memory usage reported by the rusage
structure is not significantly different between the two tests. My
guess is that closures are small but gc-intensive.