[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: lazy xml teaser
- From: Jay Carlson <nop@...>
- Date: Sun, 22 Feb 2004 22:40:34 -0500
Here's the start of the docs for something I've been working on. Lemme
know what you think.
lazynode: lazily construct XML trees
Why is this interesting? Trees are a natural data model to process
XML documents. A simple tree implementation reads the entire document
into memory at once. For large documents, this can be too expensive.
Although callback and event APIs are memory-efficient, they are
painful to program to.
Lazynode constructs an conventional-seeming XML tree as its contents
are referenced, parsing more source document as necessary to fill out
the contents on demand.
Given stylized iterators, memory consumption can be limited to
particular subtrees. Consider:
for i,child in xpairs_consume(lz) do
if child.attr.externalref then
print(child.name)
table.insert(references, child)
end
end
where the _consume family of iterators nils out elements from their
parent before returning them. If the body of the loop does not retain
a reference to the child elsewhere, it will become eligible for
garbage collection as soon as the next iteration begins.
Although not currently implemented, other consuming forms may interact
with the XML parser for greater savings:
<document>
<firstname>Jay</firstname>
<lastname>Carlson</lastname>
<bodytext>Spending too much time listening to <ref>In Utero</ref> can
be [...]
<title>I Think I'm DOM</title>
</document>
lastname, title = xmlextract.strings_consume(tree, "lastname", "title")
The strings_consume filter can potentially turn off character data and
other events inside any node it knows it doesn't need (like bodytext),
as references to them cannot possibly affect the rest of the program.
Implementation details:
Given the following XML:
<paragraph justify='centered'>first child<b>bold</b>second
child</paragraph>
A lazynode will appear to have the following contents:
lz = {name="paragraph", attr={justify="centered"},
"first child",
{name="b", "bold", n=1},
"second child",
n=3
}
However, on the start of parsing, the actual underlying table will contain:
lz = {name="paragraph", attr={justify="centered"},
_read_so_far=0
}
After a reference to lz[1], it will contain:
lz = {name="paragraph", attr={justify="centered"},
"first child",
_read_so_far=1
}
And after a reference to lz[2]:
lz = {name="paragraph", attr={justify="centered"},
"first child",
{name="b", _read_so_far=0}
}
Note that the child is read lazily as well. However, a reference to lz[3]
will force all of lz[2] to be completed:
lz = {name="paragraph", attr={justify="centered"},
"first child",
{name="b", "bold", n=1}
"second child",
_read_so_far=3
}
Reading either lz[4] (which is nil) or lz.n will force the completion
of the node.
Note that reading from lz.n will force the remainder of the node to be
read, as we don't know how long it's going to be until it closes.
In general, calling normal iterators on a lazy node is a bad idea, as
all kinds of goo may be present.
Under the surface, this is implemented by pulling events from a queue
generated from expat.