And in that case there's never any need to allocate any extra buffer: short strings are just copied by value and no longer by reference. And this considerably boosts the speed of various Lua libraries using very short strings, including notably the Lua parser and compiler itself, and reduces a lot the stress on the garbage collector (note that these short string no longer any any complex hashing for use as keys in tables). This also applies to most integer keys, that can be represented using IEEE 64-bit doubles; and I still allow special floatting points, notably denormal values, sized zeroes, the two infinite values, and large ranges of NaN values (signaling and non-signaling, keeping also their sign bit; only one half the non-signaling NaN values are used to represent all other objects, including other strings and reference types, where a smaller "object identifier" is used to replace 64-bit pointers, with just one constraint: a small alignment constraint which also offers advantages.)
This is a complete success (and already in production: it passed all existing Lua self-tests). This currently targets 64-bit platforms
A test using IEEE 32-bit floats is in progress for targetting 32-bit platforms but it adds a constraint in the maximum size of the user memory space for each object type, however for most practical cases, the limit is not reached, and I still get a detectable improvement in terms of speed, but the code is still not optimal and some worst cases are still occuring, where performance could be a bit slower; however short strings encoded as values and not as references are smaller: only 2 bytes at most for now, but I think I could encode some interesting ranges of 3-byte strings notably ASCII only, and some 4-byte strings using a subset of ASCII such as digits, letters, dot and minus-hypen, and another subset for hexadecimal strings (this test code is not finished). This 32-bit code is much more tricky, even if I want it to have a general purpose for most practical uses.
Another thing that I have improved is the allocator of objects: it uses separate memory pools for short objects depending on their size.
The next thing I will work on is data compression, notably for Unicode text (basically the code will be able to choose between several internal UTF forms; still string values in Lua will NOT be restricted to be only valid text in a valid UTF form; however I need to reduce the scanner time to detect strings that are not in any known valid UTF so that I fallback fast to a generic form using unconstrained byte values, that will not be compressed/decompressed, and some tuning to design the limit, which should probably not exceed 4 kibibytes, i.e. a single memory page, and may be a autotuning algorithm that will keep some statistics on string sizes, and possibly a way to allow the garbage collector to dynamically reencode/compress/decompress some strings that have higher frequencies of use; also caches will be using weak-referenced pages from segregated pools of memory).
Also now I no longer use at all the basic hash tables implemented in core Lua: ALL tables are now using B-trees, without any segregation between indexed part and hash part: they are much more efficient, including for the garbage collection, use less memory, and have much better locality (i.e. faster in caches); undesirable side effects of the existing segregation is unacceptable and makes the code very fragile (and easily attackable). The pairs() and ipairs() iterators are still working as expected. I am still working on it in terms of security (i.e. avoid attacks on caches, because it still uses a basic LRU strategy, for which I am already aware that it is attackable: I will redesign the caches using security domains, so that no concurrent thread can flush the cache used by other critical threads and detect some side channels: the caches will be also using pools). I'm also working on designing several storage formats for pages in the B-tree, and tuning them (notably pages in the B-tree contain BOTH the keys and values, except values encoded by references, and so the individual rows may eventually be variable in size and I want to be able to maximize the fill level of each page, except possibly the root page). I am finally working on table constructors (to get faster speed, notably for serialization/deserialization of large datasets, for use with common libraries like those handling JSON data).
For tuning parameters, I hope to find a reasonnable strategy that will allow automatic tuning (i.e. based on collecting some runtime statistics and adapt them automatically) without needing any special code in the Lua applications. Extensive tests however are needed in terms of security to avoid changes of tuning parameters to have usable side effects (I will probably add some randomness for threshold conditions where tuning parameters may be changed at runtime; but I still want the system to be reactive and autoadapt itself relatilevely fast, i.e. converge rapidly to a stable solution in about less than one hour).
For the rest, the Lua engine code is almost the same. But it allows Lua to be more safely integrated (including as a library for use in other language environments or in webservers: existing basic Lua is using too much memory, and is too easily attackable, it crashes too often even when using "stable" Lua code and web pages, exhausting maximum time or space quotas, depending on current loads on servers, or depending on server capabilities when using an heterogeneous set of servers with different capabilities: you can already see this bad effect in Wikimedia, notably in WM Commons, even if it still uses an old version of Lua).