lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


That's surprizing: such fast hash is too much easy to create collisions in very common cases, like indexing keys that have many common prefixes and suffixes.
This then suggests that applications need to create their own objects to compute or store their own hashes, in addition to the string value of the object.
That's why I suggested that tables could have their own meta function (defined in their metatable) for its accessors, that would compute the hash they need. String objects contain a storage field for their computed hash, but if a string was hashed by a different function, it can no longer be interned with others using a different hashing function.
The choice of the "middle" 4 characters for strings that are longer than 12 may be quite arbitrary (I suppose that for all strings with at most 12 chars, they are simply fully hashed in a row, or GUIDs).
The common cases where this won't work is for various database objects (e.g. indexing timestamps if they are not compacted in a binary number format)

The only solution I see would be that applications will compute the hash of the string and will prepend this hash in the string itself (but they have to be aware of the placement of significant bytes). And the 12 bytes limit means that hashes cannot distinguish more than 96-bit; most secure hashes are longer (at least 128 bit). If the application computes a secure hash, it has to compact it to 96bits by using XOR at least for MD5/SHA1 (for SHA2 or SHA3 this is probably not needed, you can just truncate the secure hash, but there's no use os SHA2/SHA3 in that case).





Le jeu. 14 mai 2020 à 02:08, Javier Guerra Giraldez <javier@guerrag.com> a écrit :
On Wed, 13 May 2020 at 18:11, Andrea <andrea.l.vitali@gmail.com> wrote:

> - is there a limit on the max length for strings to be interned? (it seems that LuaJIT applies interning only on short strings)

LuaJIT hashes and interns all strings, but uses only 12 bytes of each
(first 4, last 4 and 4 at the middle.  these might overlap on short
strings, of course), so the hash time is O(1), but trivial to get a
collision, even by accident.


> - it seems the hash function does not look at all characters in the string, or does it? (not looking at all characters increase speed but also collisions)

Lua interns only "short" strings, (40 bytes by default, IIRC), other
strings are only interned when used as table keys (and when comparing?
not sure).  at that point, i think it uses all bytes (could be wrong),
but it won't "deduplicate" it, so "long" strings are never "fully
interned"


in short, you have it almost right, just reversed between Lua and LuaJIT

--
Javier
_______________________________________________
lua-l mailing list -- lua-l@lists.lua.org
To unsubscribe send an email to lua-l-leave@lists.lua.org
_______________________________________________
lua-l mailing list -- lua-l@lists.lua.org
To unsubscribe send an email to lua-l-leave@lists.lua.org