Consuming Xml |
|
(This is part of LazyKit.)
See XmlIter for non-consuming versions of these functions.
xpairs_c(tree)
, xnpairs_c(tree)
, and
xmliter.switch_c(parent, ftable, [opts])
also
iterate over the children of tree
, but they
consume the children of tree
as they
process it. The following two fragments have similar semantics:
for i,x in xpairs(parent) do parent[i] = nil do_something_with(x) end for i,x in xpairs_c(parent) do do_something_with(x) end
Using a consuming iterator means that you do not care about accessing previously processed trees through parent. However, you can still save the current tree for later use:
for i,x,name in xnpairs(parent) do if x.name == "xref" then table.insert(references, x) end end
The primary reason to use consuming iterators is to reduce memory usage. When using conventional XML trees, this may help a little if you are building up another data structure while tearing down the XML tree; parts of the tree you have already processed are eligible for garbage collection, saving space for your new structure.
However, when using LazyTree XML trees, memory usage can be vastly smaller. Consider processing a large log file:
<log> <entry>[....]</entry> <entry>[....]</entry> [...millions of elements later...] <entry>[....]</entry> </log>
With a conventional XML tree, processing this requires space
linearly proportional to the size of all the
<entry>
elements. With normal iterators and
a lazy tree, this requires space linearly proportional to all
previously processed <entry>
elements (as
future elements are only read on demand.) With consuming
iterators and a lazy tree, processing only requires space
proportional to the size of a single <entry>
element, as previously processed <entry>
s
have been forgotten.
A secondary benefit to consuming iterators is that they may reduce CPU usage a small amount. The Lua 5.0 garbage collector does not have to work as hard during collections when less live data is present. (??? reread the GC algorithm to make sure this is true, have timing numbers though.)
What is really going on here is that iterators provide an event-based interface to tables. Consuming iterators provide many of the same benefits as pure event-based XML parsers, while allowing you to fluidly switch back and forth to a tree-based API when that makes sense.
One concrete example is XML-RPC <value>
nodes. If they contain an element such as
<integer>
, that's the value of the
<value>
node. Otherwise, the character data
content of the <value>
element is a
string-typed value. The code to process the element must
potentially look ahead an arbitrary number of initial character
data nodes to find out whether there is an element lurking
inside, or it must default to string content. An event-based
strategy can treat the list of children of an element as an
arbitrary-length lookahead buffer.
It is always safe to replace a consuming iterator with a non-consuming iterator; the only consequence may be memory exhaustion when processing huge documents.
It makes the most sense to use a consuming iterator only as the last step in processing a tree. Because of how lazy XML trees work, it is not an error to touch child nodes before calling a consuming iterator.
When recursively processing elements, you should only call a consuming iterator if you know your caller no longer cares about its contents. A rule of thumb is to only call a consuming iterator inside another consuming iterator.