Re: Stripping HTML tags

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Stripping HTML tags
From: Florian Berger <fberger@...>
Date: Tue, 16 Aug 2005 09:57:00 +0300


Thanks for comments and tips.

Roberto Ierusalimschy wrote:

<a href="http://www.example.com"; alt="> example"> example </a>

Maybe you could preprocess the string, finding all substrings inside
quotes and escaping "dangerous" characters to something else;
something like this (untested):


Interesting idea, that might be something to try.

Rici Lake wrote:
> However, you would have quite a bit of trouble with some other

> legitimate HTML constructions, particularly comments () and embedded javascript. If you want a

> bullet-proof html parser, you should probably use a tokenizer.

I thought a little bit about that and I think that the right order is toremove scripts and comments first. And after that remove other tags.


Chris Marring wrote:
> You could just use luaexpat and then extract out what you need. This
> is especially easy with the Lua Object Model feature, which simply
> returns the HTML as a hierarchy of tables. Expat is very good at

> grokking all the twisty bits of HTML, so this could help get past all> that...


How well does LuaExpat work if HTML is not clean or valid?

f

Follow-Ups:
- Re: Stripping HTML tags, Chris Marrin

References:
- Re: Stripping HTML tags, Roberto Ierusalimschy

Prev by Date: Re: Stripping HTML tags
Next by Date: LuaSockets
Previous by thread: Re: Stripping HTML tags
Next by thread: Re: Stripping HTML tags
Index(es):
- Date
- Thread