[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: (unicode) design questions
- From: spir <denis.spir@...>
- Date: Sun, 27 Dec 2009 13:42:21 +0100
Hello,
I'm building a unicode library. Basically, a UniString would be a real sequence of characters; which themselves mainly are defined by their code (point). Then, unistrings would have all typical string methods. (This is in contrast with common unicode string libraries that in fact provide --for me, useless-- methods on utf8 strings.)
>From an OO background, I've written a UniChar type, that does the job. Now, I'm wondering whether this is not overkill. Maybe implementing UniStrings as sequences of plain codes (ints) would do the job --at least in most cases? Dunno...
Advantages of UniChar type I can imagine:
* naming (eg "letter A")
* sensible output (eg: "letter y with umlaut: #ff" --in particuliar, code in hex!)
* information (eg isLetter) -- possibly retrieved from unicode databanks
Current methods of UniChar:
* clone (__call): new unichar from code
* view (__tostring), show (print view with name if available)
* equals (__eq)
* decode/encode (as of now, only from/to utf8 string representing a single char)
Actually, these methods may well be plain funcs working on or creating plain codes.
Also, in both cases, I wonder whether it's worth coding characters/codes in C.
* In case of plain code: have a 32-bit unsigned int
* In case of UniChar: have a fast implementation of time-consuming tasks (esp. utf8 decode/encode).
Hints welcome.
Also, comments welcome on the list of methods UniString (half of them already written) (see below file header).
Denis
PS: decided to call the package 'lunistring' ;-)
________________________________
la vita e estrany
http://spir.wikidot.com/
================================================
--[[ type U n i S t r i n g
unicode character string
basically a sequence of UniChar's, with string methods
UniStrings show as a list of codes, eg "61 20 09 e9 ff 100 ffff 10ffff".
TODO: UniStrings can be built from kinds of literals using
hex codes '\xxxx' (up to 8 digits), like lua strings.
content:
~ chars() iterator on chars
~ char(i?) --> char, last by default
~ size() count?
~ holds(char/str) --> logical
~ findfirst(char/str) --> position/range
~ count(char/str) --> positions/ranges
~ equals(unistring2) __eq: --> logical
--> pairs & ipairs also work!
encode/decode to/from text:
encode(encoding) --> lua string
UniString.decode(string, encoding) --> UniChar
--> use UniChar to/from UTF8/16/32 methods
modification:
~ put(i?) add char, end by default (=push)
~ change(i?) last by default
~ remove(i?) last by default
~ replace(char/str) != change!
new UniString:
~ clone(literal?) __call: (TODO: from literal)
~ concat(c2) __concat
~ slice(i1,i2)
~ multiply(n) (<=> string.rep)
higher order functions:
~ map(func)
~ filter(func)
output:
~ view() __tostring: list of codes
~ show() write name: view
TODO:
~ clone from literal
~ findlast/findall
~ trim/lefttrim/righttrim
possible TODO:
~ prototype --> subtype
~ startswith/endswith --> logical
~ sort (using custom sort func?) --> new UniString
~ find, count, & replace with regex?
~ replace using func?
-- ]]