Xml Tree |
|
(This document was originally part of LazyKit.)
This document describes a mid-level representation of XML as trees in Lua. It is intended to be saved into the lua-users wiki as a place to remember discussion on this subject.
The representation is intended to describe data from the XML Infoset, yet remain simple to work with in idiomatic Lua code. The representation is an interface, allowing fancy implementations to use metatables to provide the same interface as bare tables.
LxpTree and LazyTree implement this.
<paragraph justify='centered'>first child<b>bold</b>second child</paragraph>
lz = {name="paragraph", attr={justify="centered"}, "first child", {name="b", "bold", n=1} "second child", n=3 }
A tree is a Lua table representation of an element
and its contents. The table must have a name
key,
giving the element name.
The tree may have a attr
key, which gives a table
of all of the attributes of the element. Only string keys are
relevant. (LuaExpat? uses numeric keys to mark attributes that
were defaulted from the DTD.) A convenience iterator like
xattrpairs(tree)
should be provided.
If the element is not empty, each child node is contained in
tree[1]
, tree[2]
, etc. Child nodes
may be either strings, denoting character data content, or other
trees.
Parsers should try to merge adjacent character data content. That is, they should avoid producing something like:
{name="p", "Hello w", "orld"}
Parsers should include an n
key, giving the number
of child nodes. However, to be tolerant of tree literals in
code, general-purpose processing code should use code like
tree.n or table.getn(tree)
(found as xmliter.getn(tree)
), in the same way they
would use table.getn(list)
on normal lists instead
of list.n
.
(Why a separate getn
? This is necessary because
table.getn(tree)
does not explicitly call for
tree.n
, instead using rawget(tree, "n")
. Fancy tree implementations may need to use a
metatable call to find the number of children.)
Syntactic details of XML source files are out of scope. To wit:
The order of attributes on elements is unimportant.
The presence of a CDATA section is not interesting; it is just another way to write character data.
Comments are not interesting.
The source of attributes, whether explicit or specified in a DTD is not interesting.
All elements, regardless of duplicates.
All character data. That includes mixed content.
The order of the above.
DTD. This could go in root.dtd
.
Encoding. However, declaring everything to be in UTF-8 might not be so bad---especially for USASCII users....
Namespaces. I don't have enough experience with them to propose a design.