Re: string.pack with bit resolution

- no 128-bit integers (including IPv6 addresses or UUIDs)?

- no 80-bit long doubles (native on x86 FPU) ?

- not all IEEE number types are supported ?

- no native vector formats (AVX/SSE/GPUs...) including for OpenCL and similar APIs

- no support of variable-length lengths (using unary encoding like in several audio-video codecs) for integers of arbitrary lengths ?

- no support for differential encodings (also used on several audio-video codecs) ? including for length prefixes (note that the mantissa part can always strip the most significant bit which is always 1, except for encoding zero with a specific length value, that bit is replaced by a sign bit): integers become then encoded much like variable length floating points.

- And why do you still need space separators between EVERY item? that space counts for an incredible size in the generated stream. We can certainly imagine a format that does not require ANY separator, using distinct prefixes only, given that you already expect 8-bit clean bytes without restriction on their values, you are already building a binary format.

- no support for subtable backward references (allows non-tree structures, including recursion and common subbranches) ?

Note that this last point is different from classic dictionnary-based Lempel-Ziv compression, whose decompression creates a copy and does not imply shared data (and is then usable to compressing long "strings"). For data compression there's only a too limited repetition prefix, but I think that classic compression (like zlib) would perform better.

Or may be you assume that the result of your encoding will pass through a downstream compressor, but this adds an other layer used for everything, and requiring its own internal buffering (so performance and memory and CPU usage increases as well as the use of internal CPU caches with lower cache hits; if the two passes are chained ane not parallelized, this requires external buffering, so increases the memory or I/O footprint and this can be a limitation for use in small embedded systems like IoT: a streamable format working in one pass would be preferable)

For subtable backward references we have to choose if we want to mark specific items that will be referenced downstream (to avoid maintaining a large backward lookup buffer) and a way to remove old backward references no longer used downstream (this can be implemented by overwriting one of the previous markers; decoding them requires a dynamic dictionnary; if we don't mark, the dictionnary would be as large as the full size of the stream; but in most cases, cyclic structures are local to some subtables and no longer used later so a special mark to create a stack on which we would push and pop lookup-dictionnary contexts could cleanup instantly the past marks no longer referenced downstreams; this allows representing any kind of graph, not just trees)

Le dim. 20 oct. 2019 à 09:24, bil til <flyer31@googlemail.com> a écrit :

One further offer, concerning the “n” specifier.

To appease the “native“ fans, I would propose to keep “n” and use “N” for
“platform-independent usage”:
“n” a lua_Number (native format, faster packing / unpacking) (*)
“N” a lua_Number (platform-independet format)

(*) warning use conversion options marked “native” only locally, they can
not be used to transfer data between lua_32bit <> lua_64bit.

To allow this, I would add 3 further type numbers to the type byte list:
0x48 (4/8/16B) a lua_Number of type integer , native format(int32 / int64 /
int128)
0x49 (4/8/16B) a lua_Number of type float, native format(float32/float64/
float 128)
0x4A (4/8B+nB) a lua_String with n Bytes, preceded by length info as native
size_t

And one further improvement to the type table: Integers please should not
only support 0x04/0x08/0x10, but best 0x1…0x10 in “optimum selection”, so
for 7, 15, 23, … significant bits (MSB always the sign bit of course…). This
would have 2 advantages:
- tables with many integers would create much shorter strings typically.
- The “packed string” format for an integer table gets unique. (Before
unpacking an integer table, you could check the string to some reference
string, and if you have this already, then the unpacking is not needed, as
you then can use the table of the reference string…).

After these modifications, the complete type byte would be the following:
0x00 0B Boolean False
0x01..10 1-16B int8, int16…int63 (7, 15, …63 significant bits) (warning:
unpacking into LUA_32 will use MIN_INT32/MAX_INT32 for 0x05…0x10, and
unpacking into LUA_64 will use MIN_INT64/MAX_INT64 for x09…0x10)
0x20 0B Boolen True
0x24 4B float32 normalized
0x28 8B float64=double normalized (warning: unpack in LUA_32 will cut to
float32 and use -INF/+INF if necessary )
0x30 16B float128=long double normalized (warning: unpack in LUA_32/64 will
cut to float 32/64)
0x40 0B start of table (unpack produces " { ")
0x41 0B end of table (unpack produces " } ")
0x42 0B hash part of table starts (so unpack has to produce k=v elements)
0x48 (4/8/16B) a lua_Number of type integer , native format(int32 / int64 /
int128)
0x49 (4/8/16B) a lua_Number of type float, native format(float32/float64/
float 128)
0x4A (4/8B+nB) a lua_String with n Bytes, preceded by length info as native
size_t
0x81...0xD4: nB String with 1...100 Bytes (n=1..100)
0xE4 (n+4)B String with n Bytes, preceded by 4byte signed length info
(valid range 1...2G)
0xE8 (n+8)B String with n Bytes, preceded by 8byte signed length info
(valid range 1...8GG)
0xF0...0xF8 0B Error "no_number" (0xF0+_tt info, so
0xF0=nil,0xF2=luserdata,0xF5=table,0xF6=function,0xF7=userdata,0xf8=thread)

… and the conversion option table should be changed as shown in the
following:
<: sets little endian (this is the default endianity, if no endian option is
specfied)
>: sets big endian
=: sets native endian
q[n]: a signed bit (Q[n] unsigned) n=1...16 (successive q’s are bit-packed,
anything else then is byte-packed)
b: a signed byte (char) (B: unsigned) (“shortcut” for i1/I1)
h: a signed short (H: unsigned) (“shortcut” for i2/I2)
l: a signed long (L: unsigned) (“shortcut” for i4/I4)
‘i[n]’ a signed int with n bytes (default is native size (*)) )(I[n]
unsigned), n=1..16
(warning: unpacking into LUA_32/64 will strip ints with more than 31/63
significant bits to MIN_INT/MAX_INT)
r: a short float / float16
f: a float / float32
d: a double / float64 (warning: unpacking in LUA_32 will strip accuracy to
float and in case overflow will use –INF / +INF)
D a long double / float128 (warning: unpacking in LUA_32/64 will strip
accuracy to float32/foat64 and in case overflow will use –INF / +INF)
n: a lua_Number (faster packing/unpacking) (native size (*))
N: a lua_Number (size-optimized packing, platform-independet format)
cn: a fixed string with n Bytes
z: a zero terminated string
s[n]: a string preceded by its length coded as an unsigned integer with n
bytes (default is native size_t size for n (*))
t[n]: table (do NOT include sub-tables) (n=1: only index part, n=2: only
hash part, default: first index, then hash) (may be used ONLY at last option
in a format string)
T[n]: table (do include sub-tables) (n=1: only index part for
table+sub-tables, n=2: only hash part for table+sub tables, default: first
index, then hash for table + sub tables) (may be used ONLY at last option in
a format string)
x: one zero byte of padding
' ' or ',' or ';' empty space (ignored) (use as delimiters in the format
string)
# post-char: specifies repeat count (no byte in data str, only for format
str), then follow-up specifiers can be written #q or #i or #z or #n to use
this repeat count.
# pre-char: specifies repeat count dynamically (previous #post-char
required)
^ pre-char: specifies bit/byte length of numbers marked [n], then follow-up
specifiers can be written q^ or i^ ...
^ post-char: specifies bit/byte length dynamically (previous #post-char
required)

(*): use “native size” options only locally on one system (do NOT use to
transfer data e. g. from LUA_32 to LUA_64 or vc. vs.)

(As I told already in my blogs before, I see no application why to support
alignment larger than 1 Byte … and I think this alignment support bloats the
C code quite a bit … but if there are good reasons to support alignment,
then please correct me, then no problem to re-install the option ![n]’ …
just I really do NOT understand what you mean with the option ‘Xop’ in your
existing format table… .)

--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html