Ask yourself why these closure objects exist:
this is because functions may be recursive (and not
necessarily by trailing calls), so the variables in
closures may refer not to the immediate parent
caller frame but to some ancestor frame at arbitrary
number of levels in the calls stack). To avoid every
access to the closure varaible to pass through a
chain, there's a need for a mapping which is created
and initialized before each call to rebind each
variable for the next inner call (according to the
closure's prototype): such object is allocated only
if there are external variables used by the function
that are not local to the function itself, they are
not directly within the stack window.
This explains also because the bytecode needs
separate opcodes for accessing registers and
upvalues: if they could be directly on the stack,
it would be enough to reference upvalues with
negative register indexes and then use the stack
as a sliding window, like we do in C/C++ call
conventions (except that stack indexes in C/C++
are negative for parameters, and positive for
local variables, some of the former being cached
in actual registers, but still supported by
"shadow" variables on the stack, allocated at
fixed positions in the stack fram by the compiler
or just pushed before the actual parameters and
popped to restore these registers after the call).
There's been performance tests that show that
closures are not so fast, they can create massive
amounts of garbage collected objects (with
internal type LUA_TCLOSURE). I think this behavior
very curious, and the current implementation that
allocates the LUA_TCLOSURE objects on the heap is
not the best option, the mapping could be
allocated directly in the stack of local
variables/registers of the caller (and all these
closure objects used by the caller could be merged
to a single one, i.e. as the largest closure
object needed for inner calls, merged like in an
union. The closure objects themselves to not hold
any variable value, these are just simple mappings
from a small fixed integer sets (between 1 and the
number of upvalues of the called function) and
variable integers (absolute indexes in the
thread's stack where the actual variable is
located).
The byte code is not as optimized as it could
be: the register numbers are only positive, the
upvalue numbers are also only positive, they could
forml a single set (positive integers for local
registers, negative integers for upvalues, meaning
that they are used to index the entries in the
closure object to get access to the actual
varaible located anywhere on the stack, outside of
the immediate parent frame). The generated
bytecode is not as optimal as it could be because
various operations can only work on registers or
constants (like ADD r1,r2,r3) so temporary
registers must be allocated by the compiler (let's
remember that the number of registers is limited).
As well Lua's default engine treat all registers
the same, when most of them will work with a
single r0 register (an "accumulator") which could
be implicit for most instructions, and this would
reduce the instruction sizes (currently 32-bit or
64 bits), which is inefficient as it uses too much
the CPU L1 data cache.
I'm convinced that the current approach in the
existing Lua VM engine and its internal
instruction set can be largely improved for better
performance, without really changing the language
itself, to get better data locality (smaller
instruction sizes, less wasted unused bit fields),
and elimination of uses of the heap for closures
(to dramatically reduce the stress on the garbage
collector)