[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lua is faster than Java?
- From: Mike Pall <mikelu-0507@...>
- Date: Tue, 26 Jul 2005 00:07:52 +0200
Hi,
Mark Hamburg wrote:
> I don't know of a good way to optimize this sort of thing automatically. The
> introduction of
>
> local math_sqrt = math.sqrt
>
> Moves two table lookups out of the inner-loop, but how would the runtime
> system know that every time it looks up "math" in the globals it will get
> the same table and every time it looks up "sqrt" in that table it will get
> the same function?
It doesn't need to. It just needs to make a good guess and
then inline the 'most likely suspect'. Of course some code
to check the assumptions and to defer to the generic case
is needed, too (but this is branch-predicted as not-taken
and has little impact on speed).
This is pretty easy for a JIT compiler since it's called in
the context of the first function invocation. So it can have
a look at 'live' upvalues and globals (plus sub-tables).
Of course it must make conservative assumptions. E.g. an __index
function (and not another table) is a dead-end. But it can
make some brave assumptions, too. Like when it knows from
code analysis that a particular variable or upvalue is
write-once/read-many.
Even if the assumptions change over time (e.g. on the next
invocation) it can either adapt or just let branch prediction
do the dirty work and make the generic case the preferred
fast path.
> But inability to optimize it automatically doesn't mean that there isn't an
> interesting opportunity for a source code analyzer that could observe
> repeated global lookups and repeated lookups into values that aren't
> changing and suggest that they might be candidates for caching.
My guess is that caching a hash lookup might not pay off.
A much simpler scheme is to just do all of the table lookups
needed to check the assumptions. A single inlined lookup of
a string constant takes around 20 nanoseconds for a 1 GHz PIII.
Scale that up to a state-of-the-art CPU (P4 or K8) and you
probably get less than 5 nanoseconds. This means 200 million
lookups per second. Phew. :-)
I think it was Rici who suggested to me on IRC that a
sufficiently optimized and inlined hash method dispatch
comes pretty close to the performance of multi-level linear
dispatch. The above data supports that reasoning.
Bye,
Mike