[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: ANN: Serpent: Lua serializer and pretty printer
- From: Paul K <paulclinger@...>
- Date: Sat, 9 Jun 2012 21:38:34 -0700
Hi David,
Thank you for the detailed feedback.
> (1) Dumping _G would be a good test of this. The output of
> serpent.printmult(_G) is attached. As seen, the main downside is that
> some packages are expanded deeply inside _G.package.loaded rather than
> _G. Dumping tables at the lowest possible nesting level would be
> ideal though may add some complexity.
I agree; In fact, I tested it on _G as one of the use cases. The
packages are expanded inside _G.package.loaded because it does depth
first walk; I can see how much work it is to do a breadth first one.
This would made them expand at the lowest nesting level.
> (2) A "--[[table: 0x9aa7988]]" style comment is appended after each
> table, even when the table is non-recursive. A possible concern is
> that these addresses are not deterministic, so dumps of structurally
> equivalent tables in different program executions will give different
> results. This makes textual comparisons (diffs) of structures
> difficult and causes large diffs when serializations are maintained
> under revision control. For readability, you may wish to replace
> addresses (32- or 64-bit) with an integer counter that gives
> deterministic dumps on structural equality, though insertions could
> still cause massive renumbering. Some time ago I was working on a
> dumper like Perl Data::Dumper that would dump `x = {}; return
> {a={b=x},c={d=x}}` as "{a={b={}}, c={d=T.a.b}}", where "T" refers to
> the top-level table. Here, T could even have a metatable such that
> T.a.b evaluates to a node that is expanded on a subsequent walk
> through the table.
These comments can be disabled permanently, but they are too useful
for pretty printing. I'm thinking about an option for these.
> (3) The keys are not currently being sorted. Sorting would improve
> readability and deterministic output but may affect your performance
> numbers.
They used to be sorted. The problem is that they were sorted with
something like this:
table.sort(o, function(a,b) -- sort keys; numbers first
if tonumber(a) == tonumber(b) then return tostring(a) < tostring(b)
else return (tonumber(a) or 0) < (tonumber(b) or 0) end end)
to get numbers first and sort the rest and it was painfully slow
(probably 30% slower than the current version on the benchmark I
included). It is a huge penalty to pay for sorted keys, although in
many cases it may not matter that much if you care about pretty
printing mostly. I decided to remove it for this reason, although it
may still leave as an option.
> (4) In the attached dump of _G, the string.gmatch is curiously set to
> "nil --[[ref]]". It took me some time before realizing this was
> because the deprecated string.gfind is an alias to string.gmatch, and
> for gfind it outputs "= string.gmatch --[[function: 0x9b3ab28]]".
> Currently, in Lua 5.1, string.gmatch would serialize as string.gfind,
> which will be absent upon unserializing with for a VM's compiled with
> LUA_COMPAT_GFIND undefined. Perhaps [[ref]] should have some more
> information. It would also be ideal if it could prefer non-deprecated
> functions, but that requires special cases.
I can (and used to) point to the actual path it refers to, but the
problem was that I couldn't guarantee that the path will be safe to
deserialize. For example, I had this in one of the tests:
...nil --[[ref to a.b[a[1]]]],
which makes it invalid because of early ]]. I played with ]=+], but
even that is not a guarantee as you may have a string key that
includes that very fragment. I can analyze the string to see if it has
anything that matches my closing bracket, but it was getting too
complex/slow for my taste ;).
> (5) For pretty printing, "1\n2" would be nicer than "1\0102". \n is
> common, but beginning users might not recognize \0102.
Agree. Will do.
> (6) `package.loadlib` dumping shows an `--[[err]]`, which looks
> worrisome as presented as an error not warning. This occurs because
> it's not in globals. Other cases include package.loaders[i].
This is because the value was stringified. I intentionally didn't
serialize those as their values depend on what's loaded locally and
are probably not portable. --[[err]] is simply a visual reminder this
that value is not serialized. The main function include fourth
parameter, which when set to "true" will cause a fatal error on a
values like this.
> (7) The naming of "serialize", "printmult", and "printsing" was
> somewhat jarring to me (think: "print multiple objects" and "prints
> ing"/"print sing (song)"). I thought the latter would internally
> invoke `print` when first running it -- not that I wanted it to
> because it's not general. Also, single- and multi-line forms of
> serialization and pretty printing would suggest 2 x 2 = 4 combinations
> for orthogonality. In my recent lua-mbuild, for example (in which I
> would consider adding your library), I currently use my own trivial
> dumper to serialize a (non-recursive) data structure to disk in a
> multi-line format (to allow easier debugging), and I'd be tempted to
> call printmult even though this is serialization not pretty printing
> as the name implies. Finally, "The library provides three functions
> -- serialize, printmult, and printsing -- with the last two being
> shortcuts for the main serialize function." is not strictly correct:
> All three are wrappers around a "local function serialize" that is not
> exposed, and, as is, it's not possible to pass through a nil name in
> the public version of `serialize`.
That's correct and is my oversight. The intent was to be able to
overwrite the default value of '_' with anything and indeed you can
with anything except, alas, nil (and false). I'm rethinking the
interface in light of this (I have noticed the same thing working on a
couple of changes) and am open to suggestions. printmult does, in fact
pretty printing, rather than serialization, as it doesn't include the
section with all shared/circular references.
It may be easier to provide just one method (as I had originally), but
I didn't like the idea of always providing a variable name when full
serialization is needed (as in `serialize(t, '_')`).
> (8) The coding style is a little too compact for my taste, concerning
> readability. Usually I prefer to define variables on separate lines,
> as opposed to
>
> local n, v, c, d = "serpent", 0.1, -- (C) 2012 Paul Kulchenko; MIT License
> "Paul Kulchenko", "Serialization and pretty printing of Lua data types"
> local keyword, globals, G = {}, {}, (_G or _ENV)
> local ttype, level = type(t), (level or 0)
>
> The above single character top-level variable names aren't so great
> either, but you might just inline those values into the table at the
> bottom. Some may also object to
>
> local function safestr(s) return type(s) == "number" and
> (snum[tostring(s)] or s)
> or type(s) ~= "string" and tostring(s) -- escape NEWLINE/010 and EOF/026
> or ("%q"):format(s):gsub("\010","010"):gsub("\026","\\026") end
Yes, I was trying to keep in intentionally compact, but not too terse.
Maybe went a bit overboard with this.
> (9) Another interesting test is `_ENV = loadstring(require
> "serpent".serialize(_G))(); <test suite code>`. It won't work for all
> cases, but it does still run less than trivial things like life.lua.
>
> (10) Dumps of bytecode are probably not useful for pretty printing
> unless perhaps you were to decompile the bytecode. More useful info
> may be found in debug info (including sometimes a file name with
> source code), in cases debug.* is even permitted. Some such things,
> however, are probably outside the scope of a simple dumping module.
> Bytecode may still have limited uses though in serialization (transfer
> code between Lua states?).
I was thinking about using debug.* to provide original function
name/file in comments and to capture upvalues, but this was going
against my 90% rule (I was targeting 90% of use cases with the
existing code).
> (11) One area I've used dumpers is to dump AST's (dump.lua in
> LuaInspect). In cases like {tag=Op, '+', ...}, I'd prefer the named
> part *before* the positional part, particularly the 'tag' name at the
> very front. This might be outside the scope of this module though.
I'm frequently dumping ASTs myself and would prefer event more
changes, like having lineinfo encoded as one line while the rest may
still be multi-line and so on, but I couldn't find a straightforward
way to fit it into this code without penalty for regular use cases.
> (12) In "nil values are included when expected ({1, nil, 3} instead of
> {1, [3]=3})", I'm not sure what "expected" means given the
> undefinedness of # with holes. I'd actually expect the latter format
> deterministically for sparse arrays.
"Expected" in the sense of being "reported" by #t. if #t return 3,
Serpent will serialize it as {1, nil, 3}, rather than {1, [3]=3}. This
should match the original representation closer (although it probably
doesn't matter in most cases). #{1, nil, 3}==3, but #{1, [3] = 3} ==
1. The array content is the "same", but #t as reported is different,
so, I was trying to match that.
> (13) Maybe use 1/0 rather than math.huge (after all you already rely
> on 0/0) to avoid simple serializations of numerical arrays causing
> lookups into `math`, which won't exist under empty _ENV's sandboxes.
1/0 would be preferred indeed, but it was a tradeoff between
readability of prettyprinted output and portability. Some people may
not recognize 1/0 as math.huge, although those that serialize
math.huge numbers probably would.
I've checked in #5 and #13 and welcome your feedback on everything
else, especially #7. Thanks again.
Paul.