[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Stripping HTML tags
- From: Florian Berger <fberger@...>
- Date: Tue, 16 Aug 2005 09:57:00 +0300
Thanks for comments and tips.
Roberto Ierusalimschy wrote:
<a href="http://www.example.com" alt="> example"> example </a>
Maybe you could preprocess the string, finding all substrings inside
quotes and escaping "dangerous" characters to something else;
something like this (untested):
Interesting idea, that might be something to try.
Rici Lake wrote:
> However, you would have quite a bit of trouble with some other
> legitimate HTML constructions, particularly comments (<!-- I left out
> the <p> tag here -->) and embedded javascript. If you want a
> bullet-proof html parser, you should probably use a tokenizer.
I thought a little bit about that and I think that the right order is to
remove scripts and comments first. And after that remove other tags.
Chris Marring wrote:
> You could just use luaexpat and then extract out what you need. This
> is especially easy with the Lua Object Model feature, which simply
> returns the HTML as a hierarchy of tables. Expat is very good at
> grokking all the twisty bits of HTML, so this could help get past all
> that...
How well does LuaExpat work if HTML is not clean or valid?
f