[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: htmlentities table
- From: Rici Lake <lua@...>
- Date: Thu, 27 Oct 2005 22:50:27 -0500
On 27-Oct-05, at 2:30 PM, Walter Cruz wrote:
Hi all. Somedays algo, someone sen a mail to the list asking for a
htmlentities function.
Well, I think about how can I get a more complete list of
htmlentities to use as the table to the translation..
There is a complete (and official) list at www.w3.org, for each version
of HTML (they are very similar)
For HTML 4.01, you can get the entities from:
http://www.w3.org/TR/html401/HTMLlat1.ent
http://www.w3.org/TR/html401/HTMLsymbol.ent
http://www.w3.org/TR/html401/HTMLspecial.ent
(The equivalent HTML 4 files are at similar urls, with html4 in place
of html401.)
For XHTML:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml-lat1.ent
http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml-special.ent
http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml-symbol.ent
In each case, lat1 contains entities corresponding to all unicode
codepoints from U+00A0 to U+00FF (the latin1 character set);
special.ent contains some special characters, including < > & " (and '
in the case of xhtml) as well as the euro (U+20AC), and symbol.ent
contains a variety of symbols used in mathematics and technical
writing. Some of these are necessary to cope with the 32 code points in
CP 1252 (used by some versions of Microsoft Windows) which differ from
the ISO standards. (This is the one where the euro symbol has code
point 0x80; you can find a complete conversion chart at
http://www.microsoft.com/typography/unicode/1252.htm)
The following code snippet may help interpret the entity definitions;
it should work with either the HTML 4 or the XHTML entity formats, but
I haven't tested it.
ent2uni, uni2ent = {}, {}
function readents(filename)
for l in filename:lines() do
local name, val = l:match '^<!ENTITY (%w+)%s*CDATA "&#(%d+);"'
if name then
ent2uni[name], uni2ent[val] = val, name
end
end
end