[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Stripping HTML tags
- From: Philippe Lhoste <PhiLho@...>
- Date: Tue, 16 Aug 2005 11:29:04 +0200
Florian Berger wrote:
I thought that stripping HTML tags was easy until I saw something like
this:
<a href="http://www.example.com" alt="> example"> example </a>
I believe this is not correct HTML code, even by old pre-XHTML
standards. It should be alt="> example".
I am not sure this is valid XML, although I believe XML is more strict
on encoding & and < than on >.
Of course, the problem is that browsers are quite tolerant on this kind
of error (IMHO, there should have been stricter to start with, the Web
would be much more clean...), so are likely to find these constructs in
real pages.
My code was:
local s = '<a href="http://www.example.com" alt="> example"> example </a>'
s = string.gsub(s, '<.->', ' ')
print(string.gsub(s, '<.->', ' '))
-> example"> example
I have seen some examples using PHP and regular expressions. Programming
in Lua 20.1 says that Lua cannot do all what POSIX implementation does
(http://www.lua.org/pil/20.1.html). Can this be done in Lua? All that come
to my mind are captures but I'm not sure if they help at all. Of course
my example works in most of the cases, but it would be nice to have it
work even better.
Note there is a PCRE wrapper for Lua, if you need more powerful RegExp.
If, of course, you are not stuck with standard Lua.
Looking at other answers, I don't think an XML parser would do the job.
Regular HTML isn't even XML compliant, so an XML parser would complain
about unclosed tags like <br> or <p>.
A full tokeniser/parser could be a solution, perhaps too costly for your
need... HTML isn't really easy to parse, even more if you have to be as
tolerant to errors as the browsers are... (like the above, accepting --
in comments, raw & in text, etc.).
--
Philippe Lhoste
-- (near) Paris -- France
-- http://Phi.Lho.free.fr
-- -- -- -- -- -- -- -- -- -- -- -- -- --