[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Ideas about implementing a string type?
- From: David Kastrup <dak@...>
- Date: Fri, 30 Mar 2007 01:07:12 +0200
Rici Lake <lua@ricilake.net> writes:
> donothing: base case, does nothing but create the buffer.
> hashonly: calls into lstring.c to compute the hash
> copyonly: memcpy's the string into a buffer
> luapush: pushes and pops the string from the Lua stack
>
> I computed the data for this chart by using the command:
> $ for len in 124 7145 21583; do for x in donothing hash copy lua ; do
> time ./benchmark $x 1000000 $len ; done; done
>
> That performs the test a million times on various lengths pulled
> out of a hat. I condensed the data by reporting only user time. It's
> not a very good benchmark but it was the best i could do in a few
> minutes:
>
> length donothing hashonly copyonly luapush
> 124 0m0.008s 0m0.116s 0m0.056s 0m0.724s
> 7145 0m0.008s 0m0.120s 0m5.660s 0m7.468s
> 21583 0m0.012s 0m0.116s 0m18.061s 0m19.709s
>
> So, for shortish strings, hashing takes about twice as long as a
> copy (an order of magnitude in base 2, I guess)
Uhm, where hashing is of any use, the string is used for equality
comparison and/or table lookup. I'd be surprised if 124 was not
already about an order of magnitude above the typical size for those
applications. One exception I could think of: an application that
outputs only input lines that have not occured previously. In that
case, long keys for a table make sense.
If we take, for example, a look at the string pool of TeX, the
following string lengths appear here with the following frequencies:
0 1 17 17 33 5 49 10
2 45 18 19 34 5 50 23
3 28 19 13 35 8 51 10
4 60 20 14 36 4 52 27
5 52 21 11 37 5 53 14
6 55 22 9 38 4 54 19
7 59 23 8 39 1 55 13
8 58 24 11 40 6 56 12
9 62 25 8 41 6 57 15
10 34 26 9 42 11 58 18
11 38 27 5 43 11 59 17
12 41 28 7 44 9 60 5
13 28 29 5 45 11 61 1
14 20 30 5 46 8 62 1
15 20 31 12 47 7 64 1
16 24 32 4 48 14
The median of the string size is 13. This is likely somewhat shorter
than with other applications since TeX has neither multi-line strings,
nor does it use format control strings for output. Still, at 13 the
cost for copying from your benchmark would be about 6ms, that for
hashing about 40ms. A speedup of 5 for the median case is certainly
worth something. And that does not yet take into account cache misses
(which are pretty much invisible after a million repetitions with
constant data): the non-locality of hashing causes quite more misses
than copying.
> The luapush case is interesting, because in that case (only), the
> copy is being malloc'd and later free'd, as well as the garbage
> collector running from time to time. (I didn't control for those
> effects.) It appears that this is the dominant cost for short
> strings, but for longer strings the dominant cost is the cost of
> (ahem) copying.
Yes. For long strings, hashing is comparatively cheap. But I think
that it will also be comparatively useless for that case (since those
long strings are unlikely to be used as keys), so while it may be
cheap, it will still not give a good return of investment. The
amortization, if any, will rather be that one saves for strings that
are used for lots of indexing the check whether their hash has already
been computed. If one reserves the value "0" for an uncomputed hash,
the necessary check for 0 will be comparatively cheap (worst case is a
missed branch prediction).
Of course, LuaTeX is an extreme case, but I would not go as far as to
call it pathological.
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum