[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: LuaJIT2 vs. vararg
- From: Mike Pall <mikelu-0911@...>
- Date: Fri, 27 Nov 2009 00:57:31 +0100
Mark Hamburg wrote:
> Here were the opportunities I saw for optimization in my broadcast
> routine:
Here's a test case with a slight rewrite to get rid of the vararg
and the pcall (neither of which affects the argument below):
function broadcast(array, msg)
for i=1,#array do
local obj = array[i]
obj[msg](obj)
end
end
local x = 0
local obj = { foo = function(o) x = x + 1 end }
local t = {}
for i=1,100 do t[i] = obj end
broadcast(t, "foo")
Have a look at TRACE 2 when run with: luajit -jdump=i test.lua
---- TRACE 2 IR
0001 int SLOAD #4 RI loop bound (#array)
0002 > int LE 0001 +2147483646 guard for narrowed loop
0003 int SLOAD #3 I i
0004 > tab SLOAD #1 array
0005 int FLOAD 0004 tab.asize \
0006 > int ABC 0005 0003 / array[i] bounds check
0007 ptr FLOAD 0004 tab.array \
0008 ptr AREF 0007 0003 ) obj = array[i]
0009 > tab ALOAD 0008 /
0010 > str SLOAD #2 msg
0011 ptr HREF 0009 0010 \
0012 > fun HLOAD 0011 / method = obj[msg]
0013 > fun FRAME 0012 test.lua:9 specialize to method foo
0014 > ptr UREFO test.lua:9 #0 \
0015 > num ULOAD 0014 \ inlined
0016 + num ADD 0015 +1 / method foo
0017 num USTORE 0014 0016 /
0018 + int ADD 0003 +1 i = i + 1
0019 > int LE 0018 0001 i < loop bound?
0020 ------ LOOP ------------
0021 > int ABC 0005 0018 array[i] bounds check
0022 ptr AREF 0007 0018 \
0023 > tab ALOAD 0022 / obj = array[i]
0024 ptr HREF 0023 0010 \
0025 > fun HLOAD 0024 / method = obj[msg]
0026 > fun FRAME 0025 test.lua:9 specialize to method foo
0027 + num ADD 0016 +1 \
0028 num USTORE 0014 0027 / inlined method foo
0029 + int ADD 0018 +1 i = i + 1
0030 > int LE 0029 0001 i < loop bound?
0031 int PHI 0018 0029
0032 num PHI 0016 0027
> 1. The dynamic dispatch is over all a killer for cross-routine
> optimization, but can parts of the method lookup be hoisted based
> on the fact that the key isn't changing within the loop? We still
> have to do the method lookup per object, but I would think that
> parts of it might be amenable to moving outside the loop.
The method lookup (0011/0012) is dependent on an invariant part
(0010 msg) and a variant part (0009 obj). This means it must be
variant and cannot be hoisted. It's duplicated inside the loop
(0024/0025).
The following method dispatch is specialized to the receiver
(0013), which allows inlining of the receiver (0014-0017). But the
specialization itself cannot be hoisted (duplicated to 0026).
Often parts of the inlined method can still be hoisted (here
0027/0028 remain).
Since the method specialization cannot be hoisted, this leads to a
megamorphic dispatch inside the loop as soon as you use 'broadcast'
to send different messages. And there's nothing the compiler can do
about it due to the way the abstraction is written.
The Lua way to solve this is to use memoized code generation keyed
on the message. Table lookups with constant keys are faster, too.
If you pass arrays of homogeneous objects, you should tell the
compiler about it. It cannot infer that. You have to manually hoist
the method lookup out of the loop.
> 2. Are there optimizations available for passing varargs of
> particular lengths to the next level -- particularly in the 0 and
> 1 cases? Beyond that I would expect a simple loop copying the
> values from stack frame to stack frame to probably win, but I
> could imagine some opportunities with respect to the probably not
> uncommon cases where the varargs list is short or empty.
If the receiver is a fixarg function it would necessarily
specialize to the number of arguments. As I said, this is the easy
case for varargs.
The above argument about the megamorphic dispatch is completely
orthogonal to the vararg issue. The 'broadcast' abstraction is at
fault, the vararg problem is a side-issue.
> 3. The array iteration itself is an obvious opportunity for some
> optimization, though would it be better to write it as an ipairs
> call?
As you can see, it can hoist quite a bit of that. Do not use an
iterator when you can use a plain 'for' loop. The latter is much
easier to optimize for the compiler.
> 4. If we keep broadcasting the same message, are there
> opportunities to optimize based on that? That risks explosive
> growth if we don't routinely use the same message, so are there
> ways that the code could be structured to indicate that
> specialization based on message is likely to be useful?
Use memoized code generation, see above.
> In particular, consider the case where we are sending a
> particular message with 0 or 1 arguments and most objects will
> have the same implementation for that message. Can a tracing JIT
> pick this up?
Yes, it will automatically do this. But as you can see from above:
if you use inhomogeneous objects, the method lookup cannot be
hoisted. C++ would use a vtable dispatch in this case.
> Are there ways to help it pick it up? For example, if I've got a
> recompute graph, I might want to do something like:
>
> broadcast( self.dependents, "markDirty" )
>
> markDirty may be dynamic to allow for overrides, but most of the
> time it will probably execute fairly standard behavior. It would
> presumably be good to have this result in optimized code for the
> common case.
Performance of dynamic dispatch depends on the predictability of
the dispatch branch. If the dominant receiver is selected >90% of
the time, it's probably ok. If 10 receivers are selected with 10%
probability each, consider a redesign.
--Mike