Re: Stripping HTML tags

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Stripping HTML tags
From: Philippe Lhoste <PhiLho@...>
Date: Tue, 16 Aug 2005 11:29:04 +0200

Florian Berger wrote:

I thought that stripping HTML tags was easy until I saw something likethis:
<a href="http://www.example.com"; alt="> example"> example </a>

I believe this is not correct HTML code, even by old pre-XHTMLstandards. It should be alt="> example".I am not sure this is valid XML, although I believe XML is more stricton encoding & and < than on >.

Of course, the problem is that browsers are quite tolerant on this kindof error (IMHO, there should have been stricter to start with, the Webwould be much more clean...), so are likely to find these constructs inreal pages.

My code was:
local s = '<a href="http://www.example.com"; alt="> example"> example </a>'
s = string.gsub(s, '<.->', ' ')
print(string.gsub(s, '<.->', ' '))

-> example"> example
I have seen some examples using PHP and regular expressions. Programmingin Lua 20.1 says that Lua cannot do all what POSIX implementation does(http://www.lua.org/pil/20.1.html). Can this be done in Lua? All that cometo my mind are captures but I'm not sure if they help at all. Of coursemy example works in most of the cases, but it would be nice to have itwork even better.


Note there is a PCRE wrapper for Lua, if you need more powerful RegExp.
If, of course, you are not stuck with standard Lua.

Looking at other answers, I don't think an XML parser would do the job.Regular HTML isn't even XML compliant, so an XML parser would complainabout unclosed tags like <br> or <p>.

A full tokeniser/parser could be a solution, perhaps too costly for yourneed... HTML isn't really easy to parse, even more if you have to be astolerant to errors as the browsers are... (like the above, accepting --in comments, raw & in text, etc.).


--
Philippe Lhoste
--  (near) Paris -- France
--  http://Phi.Lho.free.fr
--  --  --  --  --  --  --  --  --  --  --  --  --  --

References:
- Re: Packaging and importing, Mark Hamburg
- Stripping HTML tags, Florian Berger

Prev by Date: LuaSockets
Next by Date: Re: LuaSockets
Previous by thread: Re: Stripping HTML tags
Next by thread: Re: Packaging and importing
Index(es):
- Date
- Thread