Re: parsing improvement

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: parsing improvement
From: Shunsuke Shimizu <grafi@...>
Date: Tue, 02 Jun 2015 22:46:55 +0900

If you can permit tricky code, you can achieve a little better speed by
parsing numbers manually using string.byte(). Parsing of large numbers
can be slowed down by this way, but I suppose this effect is negligible
since the cost of creating a substring is large when a number is large.

Following is benchmark code, decoding data 1000 times. The length of
strings within the data is between 1 to 999.

The result of the benchmark on my machine with Lua 5.1.5 is
  tonumber version:     about 3.5 sec
  string.byte version:  about 3.1 sec.

If you make strings shorter, string.byte version performs better.

----
local tostring, byte, find, sub = tostring, string.byte, string.find,
string.sub
local times = 1000

local nbRows = 999
local cols = { "col1", "col2", "col3", "col4" }

local data = ""
local as = ""
for i = 1, nbRows do
	for j = 1, #cols do
		data = data .. "<" .. tostring(i) .. " " .. as .. tostring(j) .. "/> "
	end
	as = as .. "a"
end

local t1 = os.clock()
for t = 1, times do
	local pos, rs = 0, {}
	for i = 1, nbRows do
		local row = {}

		local _, n
		for j = 1, #cols do
			_, pos, n = find(data, "<(%d+)%s", pos)
			n = tostring(n)
			local endpos = pos + n
			row[cols[j]] = sub(data, pos + 1, endpos)
			pos = endpos + 1
		end

		rs[i] = row
	end
end

local t2 = os.clock()
for t = 1, times do
	local pos, rs = 0, {}
	for i = 1, nbRows do
		local row = {}

		for j = 1, #cols do
			pos = find(data, "<", pos, true)
			local n, c1, c2, c3 = byte(data, pos + 1, pos + 4)
			n = n - 0x30
			if n < 1 or 9 < n then
				error()
			end
			while true do
				if c1 < 0x30 or 0x3A <= c1 then
					if c1 == 0x20 then
						pos = pos + 3
						break
					else
						error()
					end
				end
				n = 10 * n + (c1 - 0x30)
				if c2 < 0x30 or 0x3A <= c2 then
					if c2 == 0x20 then
						pos = pos + 4
						break
					else
						error()
					end
				end
				n = 10 * n + (c2 - 0x30)
				if c3 < 0x30 or 0x3A <= c3 then
					if c3 == 0x20 then
						pos = pos + 5
						break
					else
						error()
					end
				end
				n = 10 * n + (c3 - 0x30)
				pos = pos + 3
				c1, c2, c3 = byte(data, pos + 2, pos + 4)
			end

			local newpos = pos + n
			row[cols[j]] = sub(data, pos, newpos - 1)
			pos = newpos
		end

		rs[i] = row
	end
end

local t3 = os.clock()
print(t2 - t1, t3 - t2)


On 05/30/2015 03:25 AM, Lionel Duboeuf wrote:
> hello you all,
> 
> Just in case i'm doing it not efficiently and to learn best practices:
> I have a character stream that is formated like this one:
> 
> ...<6 orange/> <2 20/> <1 1/> <2 20/> <5 false/> <1 0/> <16 orange
> mechanics/> <2 25/>...
> 
> which correspond to a row column format like this
> t = {
>     {  "col1" = "orange" ,  "col2" = 20  },
>     {  "col1" = 1 ,  "col2" = 20  },
>     {  "col1" = false ,  "col2" = 0  },
>     {  "col1" = "orange mechanics" ,  "col2" = 25  },
> ...
> }
> 
> 
> 
> to do so, i parse it like this:
> 
> pos = current position of the stream
> 
> local rs = { }
> local sNbByte, nbByte, val, _
> local nbRows = 4
> cols = { "col1","col2" }
> for i = 1,  nbRows do
> 
>       local row = {}
> 
>       for j = 1, #cols do
> 
>         _, pos, sNbByte = string.find(data, "<(%d+)%s",pos)
>         nbByte = tonumber(sNbByte)
> 
>         if (nbByte > 0) then
>           val = string.sub(data, pos, pos + nbByte)
>           pos = pos + nbByte
>         end
> 
>         pos = pos + 1 --just after value
> 
>         row[cols[j]] = val
> 
>       end
> 
>     rs[i] = row
>   end
> 
> 
> 
> i did some benchmarks, and found using gmatch and iterating trough
> captures more efficient, but it is not usable when we need to specify a
> starting offset position (like string.find) and i don't want to split my
> string to avoid copies.
> 
> any advices will be very appreciated.
> 
> thanks
> 
> lionel
> 
>

Follow-Ups:
- Re: parsing improvement, Shunsuke Shimizu

Prev by Date: Re: A C++ Pattern Matching Library based on lstrlib.c
Next by Date: Re: parsing improvement
Previous by thread: Re: parsing improvement
Next by thread: Re: parsing improvement
Index(es):
- Date
- Thread