[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Computed goto optimization of vanilla Lua
- From: Jean-Luc Jumpertz <jean-luc@...>
- Date: Fri, 5 Feb 2016 15:29:05 +0100
> Le 4 févr. 2016 à 19:24, Roberto Ierusalimschy <roberto@inf.puc-rio.br> a écrit :
>
> Well, I still hope to get some feedback.
Well, I compared the "computed goto » patch with the vanilla Lua version this morning and got some interesting results.
The context:
- Mac OS X, 2012 Intel core i7 processor, Xcode 7.2 / clang (corresponding to clang version 3.7, I guess), -O2 -Os (standard compile options used for ‘release’ build)
- Lua 5.2.4
- Benchmark lua-image-ramp-bench.lua (see https://gist.github.com/jlj/de7d8be6f1160ea2963c), that makes use of various Lua opcodes, including lots of tables creation, table get, table set, and GC.
- Low-overhead profiling using Xcode Instruments
Test have been run multiple times inside my CodeFlow IDE, with CALL and RETURN hooks active. So the values given here are average measurements.
1) Vanilla Lua (without computed gotos)
==============================
Running Time Self (ms) Symbol Name
------------------------------------------------------------
19558.0ms 99.4% 2239,0 luaV_execute
7292.0ms 37.0% 144,0 luaC_forcestep
4380.0ms 22.2% 193,0 luaD_precall
1788.0ms 9.0% 138,0 luaH_resize
1438.0ms 7.3% 426,0 luaV_settable
1230.0ms 6.2% 59,0 luaH_new
1107.0ms 5.6% 407,0 luaV_gettable
43.0ms 0.2% 0,0 <Unknown Address>
24.0ms 0.1% 24,0 luaO_fb2int
10.0ms 0.0% 10,0 luaC_step
5.0ms 0.0% 5,0 luaH_get
2.0ms 0.0% 2,0 luaC_barrierback_
Some highlights on this profiling result:
- luaV_execute runs during 19558.0ms (99.4% of total profile time) and takes 2239,0 ms internally.
- other functions below are called from luaV_execute and are sorted by decreasing running time
- luaC_forcestep represents most of the cost of the GC, due to the high number of created short-lived tables
- luaD_precall has a rather high running cost, caused mainly by the activity of the debug hook. Actually the only called function is math.floor (once per loop)
Inside luaV_execute, we can see what takes significant time:
---------------------------------------------------
53.38% vmcase(OP_NEWTABLE,
22.78% vmcase(OP_CALL,
7.55% vmcase(OP_SETTABUP,
5.93% vmcase(OP_GETTABUP,
1.60% ra = RA(i);
1.38% vmcase(OP_GETTABLE,
1.37% vmcase(OP_ADD,
1.31% vmcase(OP_MUL,
0.85% vmcase(OP_SETTABLE,
0.69% vmcase(OP_GETUPVAL,
0.52% Instruction i = *(ci->u.l.savedpc++);
0.49% lua_assert(base <= L->top && L->top < L->stack + L->stacksize);
0.46% vmdispatch (GET_OPCODE(i)) {
0.42% vmcase(OP_FORLOOP,
0.33% int counthook = ((mask & LUA_MASKCOUNT) && L->hookcount == 0);
0.29% base = ci->u.l.base;
0.23% vmcase(OP_LE,
0.18% lua_assert(base == ci->u.l.base);
0.17% if ((L->hookmask & (LUA_MASKLINE | LUA_MASKCOUNT)) &&
0.07% vmcase(OP_LOADNIL,
2) lvm.c modified with computed goto
============================
Running Time Self (ms) Symbol Name
------------------------------------------------------------
18589.0ms 99.3% 2024,0 luaV_execute
7254.0ms 38.7% 136,0 luaC_forcestep
4045.0ms 21.6% 227,0 luaD_precall
1553.0ms 8.2% 129,0 luaH_resize
1389.0ms 7.4% 455,0 luaV_settable
1223.0ms 6.5% 59,0 luaH_new
1016.0ms 5.4% 374,0 luaV_gettable
40.0ms 0.2% 0,0 <Unknown Address>
34.0ms 0.1% 34,0 luaO_fb2int
7.0ms 0.0% 7,0 luaH_get
4.0ms 0.0% 4,0 luaC_step
Highlights:
- the overall running time of luaV_execute is significantly reduced (18589.0ms vs. 19558.0ms, i.e. 5%);
- the internal running time of luaV_execute is reduced too, by a smaller amount (2024ms vs. 2239ms) but this is still a 10% performance gain in the interpreter loop;
- where do the remaining 800ms gain come from? I can’t see any clear reason for this in the profiling info, so I would suspect better caching or branch prediction (to be confirmed by further benchmarks).
And, if you are curious about it, here is how luaV_execute consumes running time in this (computed goto) case
———————————
39.25% checkGC(L, ra + 1);
21.84% if (luaD_precall(L, ra, nresults)) { /* C function? */
9.05% Protect(luaV_settable(L, ra, RKB(i), RKC(i)));
8.66% luaH_resize(L, t, luaO_fb2int(b), luaO_fb2int(c));
7.67% Protect(luaV_gettable(L, RB(i), RKC(i), ra));
6.58% Table *t = luaH_new(L);
1.45% arith_op(luai_numadd, TM_ADD);
1.08% arith_op(luai_nummul, TM_MUL);
0.78% } vmbreak; … after vmcase(OP_SETTABLE)
0.74% } vmbreak; … after vmcase(OP_GETTABLE)
0.51% } vmbreak; … after vmcase(OP_MUL)
0.40% int b = GETARG_B(i);
0.33% } vmbreak; … after vmcase(OP_ADD)
0.22% } vmbreak; … after vmcase(OP_CALL)
0.22% } vmbreak; … after vmcase(OP_NEWTABLE)
0.18% sethvalue(L, ra, t);
0.14% setobj2s(L, ra, cl->upvals[b]->v);
0.13% int nresults = GETARG_C(i) - 1;
0.13% } vmbreak; … after vmcase(OP_GETUPVAL)
0.12% lua_Number step = nvalue(ra+2);
0.10% } vmbreak; … after vmcase(OP_FORLOOP)
0.06% lua_Number limit = nvalue(ra+1);
0.05% if (b != 0) L->top = ra+b; /* else previous instruction set top */
0.05% if (luai_numlt(L, 0, step) ? luai_numle(L, idx, limit)
0.04% if (nresults >= 0) L->top = ci->top; /* adjust results */
0.04% ci->u.l.savedpc += GETARG_sBx(i); /* jump back */
0.03% setnvalue(ra+3, idx); /* ...and external index */
0.03% lua_Number idx = luai_numadd(L, nvalue(ra), step); /* increment index */
0.03% if (b != 0 || c != 0)
0.02% int c = GETARG_C(i);
0.02% int b = GETARG_B(i);
0.02% base = ci->u.l.base;
0.01% int b = GETARG_B(i);
0.01% vmdispatch (GET_OPCODE(i)) {
0.01% } vmbreak;
0.01% arith_op(luai_numsub, TM_SUB);
(If you are still reading at this point, you are very brave :-)
Next step for me: do similar measurements on ARM-based devices, and with debug hook, to check if i can get the same level of improvements on real-wold embedded Lua apps.
Note: it could be really interesting if other people can report profiling results on this topic…
Regards,
Jean-Luc