lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


And if a small fixed number of tables with varying sizes is still not enough for the goals, you can extend this scheme to use B-trees: each page of a B-tree can be sequential or hashed subtable, and pages can have a predefined maximum size to store for example up to 1024 values in sequential subtables, or up to 512 values in hashed subtables (in pages of B-trees used as hashed subtables, the mapping from an key's hashvalue to a slot in the hashed subtable should use different bits from the hash value, depending on the page depth in the tree: rehashing keys will ne needed only for keys that are moved up or down from one level to another, during B-tree management)

B-trees with fixed but tunable pages sizes and tunable fill factors (for all depths or for some specific depths ouside the root level) are used in relational databases in their storage. Their advantage is that they minimize the number of reshaping operations needed, and they can also optimize how keys in the same page are represented (e.g. for string keys, their common initial part may be compacted, allowing more keys to be stored in that page depending on each key length and how their values are ditinguished; database engines can also use datacompression in each pages to allow more compaction of large volumes of keys in their index, they may also store some or all values, not necessarily only other indexed keys, inside the same page for "clustered" indexes, by dedicating an minimum size reserved for the keys, and placing values if possible in the rest of the page instead of separate data-only pages).


Le jeu. 16 avr. 2020 à 20:58, Philippe Verdy <verdyp@gmail.com> a écrit :
There are however good strategies that are intended to limit the volume of rehashing:
For example, instead of using a single hash table with a fixed size, you may want to use a set of subtables with different sizes:
- Each subtable will used in the same order a which they were created. They may have non-necessarily growing sizes. Each one has its own fill factor thresholds (min and max) defined according to their allocated size.
- Each subtable may be of one of two types: sequential (not storing keys when they are integers in a contiguous range), or hashed. Integer keys will be looked up in sequential subtables (if they are in their range), otherwise in the hashed subtables; all other keys (including integer keys not matching the ranges for sequential subtables) will be only in one of the hashed subtables (that will store the keys and values and not just the values).
- When one of the hashed subtables exceeds its maximum threshold (meaning it has too long collision lists), you may want to rehash a part of its keys into one of the other subtables: first try relocating the value into one of the sequential tables, if their range fit for an an integer key, else relocate the (key,value) pair in one of the other hashed subtables that does not exceed its threshold. If all subtables have reched their maximum threshold, you'll need to merge the two smallest tables into a larger one to make place for a new hashed table.

When unsetting a key's value to null, its previous slot in one of the subtables may become null:
- if that slot was in a sequential subtable, that subtables may become larger than needed so that its minimum threshold will be exceeded: you may want to truncate that table if there are enough null slots at start or end of the sequential subtable to reduce the range of integer keys it represents (the minimum integer key value is storable, the maximum integer key value is implied for the minimum and the subtable size). Otherwise you could split the table in two parts, merging one part into another sequential subtable which could fit some keys in their range, or putting the remaining isolated elements into the hashed subtables, before reducing the sequential subtable size.
- if that slot was in a hashed subtable, that subtable may also have too many free slots for its minimum fill factor threshold, and here again you may want to determine if its keys should be merged into the other sequential subtables (if they fit the key) or other hashed subtables (if their current capacity allows it without exceeding its maximum threshold), or if another hashed subtable should be merged into it. Which subtable you'll merge to another should be based on the number of keys they currently store. This can then allow to free one of the hashed subtables.

All in all, the work needed for reshing will be more limited, the tables will autoadjust themselves to optimize the sequences (including for the current worst case where a table is filled sequentially but in backward order, where almost everything goes to the hashed subtable). Fill factor thresholds in each subtables will help maintain the rempresentation compact, and the chains for key lookup will be minimized.


Le jeu. 16 avr. 2020 à 19:48, Philippe Verdy <verdyp@gmail.com> a écrit :
In hashed tables, the main slot for a hash may be already used by keys with other hashes (not just because of collisions causing them to be chained to another free slot, but also because the size of the hash table is typically much smaller than the full set of possible hash values, for example when hash values are 32 bits, this does not mean that you need a hash table with 2^32 slots).

So it is legal (by definition) for the same chain of keys to link keys having different hashes: each time you browse items in the chain from a starting point, the key stored in that position is not necessarily the one you look for, so you need to compare the actual key to see if it matches, and if not continue visiting the chain until its end; if no keys are matched in the chain, then the key is not mapped in the table and the implicit associated value returned for it will be nil/null (and not "undefined" which is another non-nil/non-null value specifically used for declared variabled that were still unassigned: their key in the hash table is the variable name, or property name in objects).

When you set a key to null, the existing chain is not modified (it would require a table scan to find from which it is linked, or would require storing chains in two directions, i.e. would require storing two pointers per slot instead of one for the chains, in addition to the point to the key; such full table scan that would be needed would be very slow in large hash tables; typically the hash table structure maintains a counter of the used slots, which is incremented when a slot is allocated from a free slot: free slots can form their own chain where the value pointer is just nil/null to avoid table scans to find free slots; however when you reset the value associated to a key to null, the slot is NOT freed and its existing chaining pointers are preserved, to not break these chains).

There are different strategies possible:
- if the hash table is very small, performing a table scan when reseting a key's value to null/nil could be made in order to locate the allocated slot still referencing it in its chain and then update that other slot to skip over the "freed" slot.
- if the hash table is large, this is not acceptable (and using douly-linked chains would use too much memory: 4 pointers per slot (pointers to the key and to the value, and pointers for the forward chain and backward chains), instead of just 3 (omitting the backward chain).
- how pointers for chans are represented can be tuned: for small tables, you don't necessarily need a real pointer, just integer index for locating slots and if the hash size is small it could be a single 16-bit word, so that backward chains could be managed (this would reduce the frequency of collisions for distinct hashes at moderate fill level).
- for large tables, a "fill factor" threshold hint is used to determine when to reallocate a larger (or smaller) hash table (forcing keys to be rehashed and reloaded in new slot positions of the new hash table: here again, not all bits of the hash value are used, just a different number of bits, suitable for the hash table size). The fill factor thresholds (minimum and maximum) tend to be high for large tables, typically around 75% for the minimum and 85% for the maximum, but these threshold should be low for small hash table sizes (and the minimum threshold for very small tables could be safely 0% while the maximum at 50% would be fine).

I don't know if Lua has experimented in its implementation a way to manage the two thresholds for the fill factors at which a hashtable has to be reallocated and all keys in the associative table have to be rehashed (rehashing can be very slow in large tables). Or on a strategy allowing existing bits of hashes to be preserved (I've not seen successful experiments about such strategy).

But keep this in memory: an hash table always uses an optimistic where it is hoped that chains of collisions won't be very long. For this, it is critical that the hashing function (used to compute some bits from the actual keys) correctly distributes the keys.

What Lua has implemented however is segregating a part of the keys used by integer keys that form a "sequence" (keys from 1 to N, with very few holes in the middle): they are not stored in a hash with collision chains but in a simple vector where there can exist unused slots whose value pointer is nil/null, but where there's no chain pointers at all: access to these keys that are part of the sequence range is direct and does not require any hashing, but integer keys that fall outside the main sequence range are stored in the normal hash table (and those integers have then to be hashed). It is critical that the hashing function is correctly shuffling the bits of the key into a randomized set (simple hashing functions using multiplication by a small prime before adding the value of new bytes may not be good enough if the chosen prime is much smaller than the hash table; you may want to use a stronger hashing function like SHA-1 but it is generally much slower and could be costly to populate very large tables; using a good large prime multiplier is the correct strategy which is suitable for any hash table size which is not a multiple of that prime, in which case mapping an hash value to a slot index is simply stripping high bits, or computing a simple modulo if the hash table size is not a power of two)






Le jeu. 16 avr. 2020 à 13:51, 彭健 <bzijiang@gmail.com> a écrit :
hello,Is anyone looking at this?  I have a lot of doubts. Maybe I can get some answers

彭健 <bzijiang@gmail.com> 于2020年4月10日周五 上午10:10写道:
【lua table】Different hashes are chained together,is a bug?    I can still find 101 115 108 115。It looks normal,but is this allowed? 

 
lua-1.png

gdb:

lua-2.png

lua-3.png