[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Stripping HTML tags
- From: Chris Marrin <chris@...>
- Date: Tue, 16 Aug 2005 07:49:05 -0700
Florian Berger wrote:
...
Chris Marring wrote:
> You could just use luaexpat and then extract out what you need. This
> is especially easy with the Lua Object Model feature, which simply
> returns the HTML as a hierarchy of tables. Expat is very good at
> grokking all the twisty bits of HTML, so this could help get past all
> that...
How well does LuaExpat work if HTML is not clean or valid?
My experience with expat (NOT used with LuaExpat) is that it makes a
valiant effort to deal with a few things. But for the most part, invalid
HTML generates an error and aborts. I think there is a way to get expat
to continue if there is a validity error. For instance, I think you can
get it to handle the case where a an EndElement has the wrong name. But
for the most part invalid HTML, like invalid C, is hard to fix and make
any sense of. And I don't know how easy it would be to get LuaExpat to
be tolerant of errors.
My general rule is "always use valid HTML" :-)
--
chris marrin ,""$, "As a general rule,don't solve puzzles
chris@marrin.com b` $ that open portals to Hell" ,,.
,.` ,b` ,` , 1$'
,|` mP ,` :$$' ,mm
,b" b" ,` ,mm m$$ ,m ,`P$$
m$` ,b` .` ,mm ,'|$P ,|"1$` ,b$P ,` :$1
b$` ,$: :,`` |$$ ,` $$` ,|` ,$$,,`"$$ .` :$|
b$| _m$`,:` :$1 ,` ,$Pm|` ` :$$,..;"' |$:
P$b, _;b$$b$1" |$$ ,` ,$$" ``' $$
```"```'" `"` `""` ""` ,P`