[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Ideas about implementing a string type?
- From: David Kastrup <dak@...>
- Date: Thu, 29 Mar 2007 14:49:28 +0200
Luiz Henrique de Figueiredo <lhf@tecgraf.puc-rio.br> writes:
>> Such strings would not get hashed unless one used them as a table
>> index, and they would not get unified or copied either unless
>> necessary.
>
> Define necessary.
>
> My first impression is that this would break string comparison,
Comparison could or could not be considered part of the "necessary"
definition.
> unless you make extensive changes in the core. String comparison in
> Lua is simply a pointer comparison, because of string
> unification.
Comparison for equality, yes. I have my doubts that the cost of
hashing will usually be amortized unless a string is used for hashing.
The "usual" string comparison will end after few bytes. One notable
exception is the sorting of a large number of strings: in that case,
long strings are probably going to be compared frequently. But if
they are not identical, the hashing won't buy anything, and if a large
number of long identical strings is going to be expected, then sorting
will be sped up by recording "multiplicity" for strings comparing
equal and folding them into a single item to be sorted whenever a
comparison finds equal strings. This is going to be quite more
efficient than the hashing behind the scenes.
> From what I understand of your proposal, for lazy strings this
> comparison would have to be on the strings contents.
Comparison for equality could trigger hashing, but I think that in
most use cases of relevance, strings would be used as indices into a
table, anyway.
> One way to avoid excessive string hashing from the C side is to use
> store strings in the array part of a Lua table. You'll still need a
> map from C strings to indices, but if you're willing to add a
> complication such as lazy strings, then that map shouldn't be too
> bad. Again, my impression is that you'll probably end up creating
> some sort of string unification...
Lisp has the concept of symbols. Symbols are basically unique
entities similar to Lua strings. However, they are unique within
their "obarray". You can also create "uninterned" symbols which,
while having a given printing name, are not unified at all with other
symbols having the same name. You can also intern symbols in a
different "obarray" rather than the default obarray.
Inside of an "obarray", a name can be looked up and resolved to a
symbol, in a mechanism similar to table lookup in Lua.
If people use string hashing, they usually create an obarray of their
own in order to avoid global obarray pollution.
If there was a possibility to associate strings with string tables
(rather than the global string table), including no string table at
all, then strings would compare as eq only when associated with the
same string table and identical in content. A string created in the
context of string table "nil" could not be used for indexing at all.
Strings interned into different string tables would not be identical
as indices.
While one might be of the opinion that a global hash table is O(1) in
access time regardless of size, this is not actually true since cache
misses are often a function of the table size.
These musings are somewhat orthogonal to the matter of lazy hashing.
However, the ability of creating non-interned strings by default from
application data might help to ensure that one has no unintended
occurences of hashing: using an uninterned string as a table index
should likely just lead to an error.
>> implementing this as a separate [...] library might also be an
>> option.
>
> I suggest you try it first as a library. I think there are one or
> two "mutable string" libraries around, though I don't have a pointer
> handy right now.
I'll have to see what I can find. But I can't help like feeling that
lazy hashing and the ability to specify the string table to use (if
any) might make a good fit into the core and would help keeping data
structures for contained problems contained.
--
David Kastrup