|
Hi
Patrick!
I did quite some experimenting with tiny UTF-8 handling here: http://eonblast.com/trucount/ http://www.eonblast.com/trucount/lua-count-patch-0.1.tgz Mainly concerning myself with getting the lenght. For what it's worth, ended up with this to count string lenght: /* UTF-8 count */ case LUA_TSTRING: { unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb)); unsigned char *q = p + tsvalue(rb)->len; size_t count = 0; while(p < q) if((*p++ & 0xC0) ^ 0x80) count++; /* count all lead bytes */ setnvalue(ra, cast_num(count)); break; } The rational is spread out across this mailing list. Basically, corrupt UTF-8 should be allowed to have undefined results. Best, Henning Patrick Rapin schrieb: Essentially as an exercise, I tried to write the smaller possible |