[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 patterns in Lua 5.3
- From: Sean Conner <sean@...>
- Date: Thu, 17 Apr 2014 12:16:54 -0400
It was thus said that the Great Dirk Laurie once stated:
> 2014-04-17 9:40 GMT+02:00 Oliver Kroth <oliver.kroth@nec-i.de>:
> > to my knowledge in most "big" OS, there are already libraries for handling
> > Unicode semantics.
> > I'd like to propose to let Lua do the UTF-8 encoding matters, and use a
> > (probably OS-specific) glue library to refer the Unicode semantics to the
> > underlying OS. This library may e.g. be named "unicode" to avoid name
> > clashes with utf8.
> >
> > There is no sense in re-inventing the wheel.
>
> Invoking OS support is not as well supported in Lua as in e.g. Perl.
>
> One intensely annoying restriction that Lua suffers, out of portability
> considerations no doubt, is that we can only have a write-to pipe or
> a read-to pipe, not a filter. I'd love to write
>
> textout = os.filter("iconv -f windows-1250 -t utf8",textin)
First off, there may be Lua modules that link to iconv [1] so there really
shouldn't be a reason to filter out to that program.
Secondly, having attempted to do filter like stuff (piping to a program
for read/write) and failing miserably [2], I will go into details of why it
failed miserably (at least for Unix).
Several years ago I wrote a program where I was indexing a bunch of files
in a directory. I wanted to run file over each file to get its type (and
not necessarily rely upon the extention). I was already getting a list of
files, and file could accept filenames on stdin (the "-f-" option), so I
thought to myself, "Self, I could set up a pipe such that I write a list of
files to it, and read the file types back out."
I did that, and the program immediately locked up. It didn't crash, but
it wasn't running either. And the problem wasn't a bug in my code, nor a
bug in file, but in the semantics of C's handling of stdin and stdout with a
non-tty stream.
If I do:
GenericUnixPrompt> program1 | program2
stdout of program1 is a pipe and stdin of program2 is also a pipe.
Obviously. But what isn't quite so obvious (unless you looked it up) is how
C handles stdout and stdin (through <stdio.h>) is that by default, the
buffering is fully buffered, meaning, the data to stdout isn't written until
you reach some threshhold (around 4k to 8k in a typical Unix
implementation), and the same holds for stdin, except in reverse (no data is
returned until there's around 4-8k read). And it does no good to change the
buffering of the output side to "nothing"
(setvbuf(stdout,NULL,_IONBUF,BUFSIZ)) because you still have full buffering
on the input side.
I got around the issue by using some (Unix) linking magic. Basically, I
set LD_PRELOAD (an environment variable) to a shared library that was
nothing other than:
void __attribute__ ((constructor)) init(void);
void init(void)
{
setvbuf(stdin, NULL,_IOLBF,BUFSIZ);
setvbuf(stdout,NULL,_IOLBF,BUFSIZ);
}
so that when I did (approximately):
fp = popen("magic -f-","r+");
the shared library was opened as the "magic" program was being loaded, the
init() function called to initialize the buffering on stdin/stdout so this
whole mess would work (otherwise, I would have had to modify the source code
to "file" and I didn't want to go to that trouble).
Yes, this is a form of monkeypatching (to tie this into another thread
around here).
Yes, this worked. But I required extra configuration (I needed to keep
track of where the special shared library was so I could load it) and I
never did feel exactly good about it (and later I found out about the
"magic" library that "file" was a wrapper around and fixed the program to
use that instead of this gross hack). So yes, there was a simpler, more
traditional way to do what I wanted. [3]
So yes, that's why doing a read/write pipe to a program is usually not
done.
-spc (Been there, done that, oddly, never got a tee-shirt)
[1] Oh, say, here:
https://github.com/spc476/lua-conmanorg/blob/master/src/iconv.c
[2] For various values of "miserably".
[3] "To the point that smart, experienced hackers reach for a monkey
patch as their tool of first resort, _even when a simpler, more
traditional solution is possible_."
http://devblog.avdi.org/2008/02/23/why-monkeypatching-is-destroying-ruby/