[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: string hash
- From: spir <denis.spir@...>
- Date: Fri, 14 Dec 2012 10:22:04 +0100
On 13/12/2012 08:21, Richter, Jörg wrote:
Strings are stored in a hash table but the string hash value is reduced
modulo the size of the hash table and so it helps to compare string hash
values for colliding strings before resorting to memcmp. See
http://www.lua.org/source/5.2/lstring.c.html#luaS_newlstr
After reading this I applied the same optimization to my hash table
implementation.
Are you sure this is worthwhile? My unrepresentative performance tests are
somewhat slower (2-3%) than without this extra compare.
But I have no performance tests for Lua to double check.
Perhaps anyone can run their tests with "h == ts->tsv.hash &&" commented out.
Jörg
So, here are some measures. One issue is: what to choose as supposedly
representative input data? In particular, the proportion of equal strings (ones,
which are actually stored in the pool, either because we mean to look them up or
by chance) determines the number of petential memcmp's and thus the possible
influence of checking the hash before. So, I decided to feed the pool with
random strings of fix length=7 (so that len check does not pre-filter, but it's
unreal) among an alphabet of varying number of letters. Which gives:
without hash check:
#strings:1000000 #letters:6 %equal:27
time : 0.450
#strings:1000000 #letters:7 %equal:58
time : 0.470
#strings:1000000 #letters:8 %equal:80
time : 0.480
with hash check:
#strings:1000000 #letters:6 %equal:27
time : 0.450
#strings:1000000 #letters:7 %equal:58
time : 0.470
#strings:1000000 #letters:8 %equal:80
time : 0.490
%equal is the proportion of equal strings. Numbers are fairly constant among
different runs. Time is CPU time (just diff in clock()). So, about no influence
at all in this test case. At best (or worst) I get ~ 2% in disfavour of checking
the hash with high proportions of equal strings. Conclusion: in absence of more
info (such as the reason why it's there in Lua), let it down.
However, I initially placed the hash in the string struct (as is done in Lua, a
point I stepped on by chance and was the reason for my question) for another
reason: i keep the hash anyway to avoid re-hashing when the pool grows, so why
not put it in the string instead of on the cell struct (node)? If ever useful,
it's there, while someone having access to the string but not to the cell would
just not have it...
Denis