[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Computed goto optimization of vanilla Lua
- From: Jean-Luc Jumpertz <jean-luc@...>
- Date: Fri, 5 Feb 2016 15:32:58 +0100
Sorry. There is a typo in the last sentence below. It should read:
Next step for me: do similar measurements on ARM-based devices, and WITHOUT debug hook, to check if i can get the same level of improvements on real-wold embedded Lua apps.
Jean-Luc
> Le 5 févr. 2016 à 15:29, Jean-Luc Jumpertz <jean-luc@celedev.eu> a écrit :
>
>
>> Le 4 févr. 2016 à 19:24, Roberto Ierusalimschy <roberto@inf.puc-rio.br> a écrit :
>>
>> Well, I still hope to get some feedback.
>
> Well, I compared the "computed goto » patch with the vanilla Lua version this morning and got some interesting results.
>
> The context:
> - Mac OS X, 2012 Intel core i7 processor, Xcode 7.2 / clang (corresponding to clang version 3.7, I guess), -O2 -Os (standard compile options used for ‘release’ build)
> - Lua 5.2.4
> - Benchmark lua-image-ramp-bench.lua (see https://gist.github.com/jlj/de7d8be6f1160ea2963c), that makes use of various Lua opcodes, including lots of tables creation, table get, table set, and GC.
> - Low-overhead profiling using Xcode Instruments
>
> Test have been run multiple times inside my CodeFlow IDE, with CALL and RETURN hooks active. So the values given here are average measurements.
>
> 1) Vanilla Lua (without computed gotos)
> ==============================
>
> Running Time Self (ms) Symbol Name
> ------------------------------------------------------------
> 19558.0ms 99.4% 2239,0 luaV_execute
> 7292.0ms 37.0% 144,0 luaC_forcestep
> 4380.0ms 22.2% 193,0 luaD_precall
> 1788.0ms 9.0% 138,0 luaH_resize
> 1438.0ms 7.3% 426,0 luaV_settable
> 1230.0ms 6.2% 59,0 luaH_new
> 1107.0ms 5.6% 407,0 luaV_gettable
> 43.0ms 0.2% 0,0 <Unknown Address>
> 24.0ms 0.1% 24,0 luaO_fb2int
> 10.0ms 0.0% 10,0 luaC_step
> 5.0ms 0.0% 5,0 luaH_get
> 2.0ms 0.0% 2,0 luaC_barrierback_
>
> Some highlights on this profiling result:
> - luaV_execute runs during 19558.0ms (99.4% of total profile time) and takes 2239,0 ms internally.
> - other functions below are called from luaV_execute and are sorted by decreasing running time
> - luaC_forcestep represents most of the cost of the GC, due to the high number of created short-lived tables
> - luaD_precall has a rather high running cost, caused mainly by the activity of the debug hook. Actually the only called function is math.floor (once per loop)
>
> Inside luaV_execute, we can see what takes significant time:
> ---------------------------------------------------
> 53.38% vmcase(OP_NEWTABLE,
> 22.78% vmcase(OP_CALL,
> 7.55% vmcase(OP_SETTABUP,
> 5.93% vmcase(OP_GETTABUP,
> 1.60% ra = RA(i);
> 1.38% vmcase(OP_GETTABLE,
> 1.37% vmcase(OP_ADD,
> 1.31% vmcase(OP_MUL,
> 0.85% vmcase(OP_SETTABLE,
> 0.69% vmcase(OP_GETUPVAL,
> 0.52% Instruction i = *(ci->u.l.savedpc++);
> 0.49% lua_assert(base <= L->top && L->top < L->stack + L->stacksize);
> 0.46% vmdispatch (GET_OPCODE(i)) {
> 0.42% vmcase(OP_FORLOOP,
> 0.33% int counthook = ((mask & LUA_MASKCOUNT) && L->hookcount == 0);
> 0.29% base = ci->u.l.base;
> 0.23% vmcase(OP_LE,
> 0.18% lua_assert(base == ci->u.l.base);
> 0.17% if ((L->hookmask & (LUA_MASKLINE | LUA_MASKCOUNT)) &&
> 0.07% vmcase(OP_LOADNIL,
>
>
>
> 2) lvm.c modified with computed goto
> ============================
>
> Running Time Self (ms) Symbol Name
> ------------------------------------------------------------
> 18589.0ms 99.3% 2024,0 luaV_execute
> 7254.0ms 38.7% 136,0 luaC_forcestep
> 4045.0ms 21.6% 227,0 luaD_precall
> 1553.0ms 8.2% 129,0 luaH_resize
> 1389.0ms 7.4% 455,0 luaV_settable
> 1223.0ms 6.5% 59,0 luaH_new
> 1016.0ms 5.4% 374,0 luaV_gettable
> 40.0ms 0.2% 0,0 <Unknown Address>
> 34.0ms 0.1% 34,0 luaO_fb2int
> 7.0ms 0.0% 7,0 luaH_get
> 4.0ms 0.0% 4,0 luaC_step
>
> Highlights:
> - the overall running time of luaV_execute is significantly reduced (18589.0ms vs. 19558.0ms, i.e. 5%);
> - the internal running time of luaV_execute is reduced too, by a smaller amount (2024ms vs. 2239ms) but this is still a 10% performance gain in the interpreter loop;
> - where do the remaining 800ms gain come from? I can’t see any clear reason for this in the profiling info, so I would suspect better caching or branch prediction (to be confirmed by further benchmarks).
>
> And, if you are curious about it, here is how luaV_execute consumes running time in this (computed goto) case
> ———————————
> 39.25% checkGC(L, ra + 1);
> 21.84% if (luaD_precall(L, ra, nresults)) { /* C function? */
> 9.05% Protect(luaV_settable(L, ra, RKB(i), RKC(i)));
> 8.66% luaH_resize(L, t, luaO_fb2int(b), luaO_fb2int(c));
> 7.67% Protect(luaV_gettable(L, RB(i), RKC(i), ra));
> 6.58% Table *t = luaH_new(L);
> 1.45% arith_op(luai_numadd, TM_ADD);
> 1.08% arith_op(luai_nummul, TM_MUL);
> 0.78% } vmbreak; … after vmcase(OP_SETTABLE)
> 0.74% } vmbreak; … after vmcase(OP_GETTABLE)
> 0.51% } vmbreak; … after vmcase(OP_MUL)
> 0.40% int b = GETARG_B(i);
> 0.33% } vmbreak; … after vmcase(OP_ADD)
> 0.22% } vmbreak; … after vmcase(OP_CALL)
> 0.22% } vmbreak; … after vmcase(OP_NEWTABLE)
> 0.18% sethvalue(L, ra, t);
> 0.14% setobj2s(L, ra, cl->upvals[b]->v);
> 0.13% int nresults = GETARG_C(i) - 1;
> 0.13% } vmbreak; … after vmcase(OP_GETUPVAL)
> 0.12% lua_Number step = nvalue(ra+2);
> 0.10% } vmbreak; … after vmcase(OP_FORLOOP)
> 0.06% lua_Number limit = nvalue(ra+1);
> 0.05% if (b != 0) L->top = ra+b; /* else previous instruction set top */
> 0.05% if (luai_numlt(L, 0, step) ? luai_numle(L, idx, limit)
> 0.04% if (nresults >= 0) L->top = ci->top; /* adjust results */
> 0.04% ci->u.l.savedpc += GETARG_sBx(i); /* jump back */
> 0.03% setnvalue(ra+3, idx); /* ...and external index */
> 0.03% lua_Number idx = luai_numadd(L, nvalue(ra), step); /* increment index */
> 0.03% if (b != 0 || c != 0)
> 0.02% int c = GETARG_C(i);
> 0.02% int b = GETARG_B(i);
> 0.02% base = ci->u.l.base;
> 0.01% int b = GETARG_B(i);
> 0.01% vmdispatch (GET_OPCODE(i)) {
> 0.01% } vmbreak;
> 0.01% arith_op(luai_numsub, TM_SUB);
>
>
> (If you are still reading at this point, you are very brave :-)
>
>
> Next step for me: do similar measurements on ARM-based devices, and with debug hook, to check if i can get the same level of improvements on real-wold embedded Lua apps.
>
> Note: it could be really interesting if other people can report profiling results on this topic…
>
> Regards,
> Jean-Luc
>
>