Re: parsing improvement

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: parsing improvement
From: Shunsuke Shimizu <grafi@...>
Date: Tue, 02 Jun 2015 23:22:01 +0900

Sorry, the line `n = tostring(n)` does nothing. This lines is inteneted
to be `n = tonumber(n)` (and tonumber should also be cached as a local
variable). This change has little performance effect.

Interestingly, erasing the line `n = tostring(n)` makes the version
faster than string.byte version. If you do not need error handling, this
seems to be the best way (even though string.byte version wins when
nbRows = 99 instead of 999).

On 06/02/2015 10:46 PM, Shunsuke Shimizu wrote:
> If you can permit tricky code, you can achieve a little better speed by
> parsing numbers manually using string.byte(). Parsing of large numbers
> can be slowed down by this way, but I suppose this effect is negligible
> since the cost of creating a substring is large when a number is large.
> 
> Following is benchmark code, decoding data 1000 times. The length of
> strings within the data is between 1 to 999.
> 
> The result of the benchmark on my machine with Lua 5.1.5 is
>   tonumber version:     about 3.5 sec
>   string.byte version:  about 3.1 sec.
> 
> If you make strings shorter, string.byte version performs better.
> 
> ----
> local tostring, byte, find, sub = tostring, string.byte, string.find,
> string.sub
> local times = 1000
> 
> local nbRows = 999
> local cols = { "col1", "col2", "col3", "col4" }
> 
> local data = ""
> local as = ""
> for i = 1, nbRows do
> 	for j = 1, #cols do
> 		data = data .. "<" .. tostring(i) .. " " .. as .. tostring(j) .. "/> "
> 	end
> 	as = as .. "a"
> end
> 
> local t1 = os.clock()
> for t = 1, times do
> 	local pos, rs = 0, {}
> 	for i = 1, nbRows do
> 		local row = {}
> 
> 		local _, n
> 		for j = 1, #cols do
> 			_, pos, n = find(data, "<(%d+)%s", pos)
> 			n = tostring(n)
> 			local endpos = pos + n
> 			row[cols[j]] = sub(data, pos + 1, endpos)
> 			pos = endpos + 1
> 		end
> 
> 		rs[i] = row
> 	end
> end
> 
> local t2 = os.clock()
> for t = 1, times do
> 	local pos, rs = 0, {}
> 	for i = 1, nbRows do
> 		local row = {}
> 
> 		for j = 1, #cols do
> 			pos = find(data, "<", pos, true)
> 			local n, c1, c2, c3 = byte(data, pos + 1, pos + 4)
> 			n = n - 0x30
> 			if n < 1 or 9 < n then
> 				error()
> 			end
> 			while true do
> 				if c1 < 0x30 or 0x3A <= c1 then
> 					if c1 == 0x20 then
> 						pos = pos + 3
> 						break
> 					else
> 						error()
> 					end
> 				end
> 				n = 10 * n + (c1 - 0x30)
> 				if c2 < 0x30 or 0x3A <= c2 then
> 					if c2 == 0x20 then
> 						pos = pos + 4
> 						break
> 					else
> 						error()
> 					end
> 				end
> 				n = 10 * n + (c2 - 0x30)
> 				if c3 < 0x30 or 0x3A <= c3 then
> 					if c3 == 0x20 then
> 						pos = pos + 5
> 						break
> 					else
> 						error()
> 					end
> 				end
> 				n = 10 * n + (c3 - 0x30)
> 				pos = pos + 3
> 				c1, c2, c3 = byte(data, pos + 2, pos + 4)
> 			end
> 
> 			local newpos = pos + n
> 			row[cols[j]] = sub(data, pos, newpos - 1)
> 			pos = newpos
> 		end
> 
> 		rs[i] = row
> 	end
> end
> 
> local t3 = os.clock()
> print(t2 - t1, t3 - t2)
> 
> 
> On 05/30/2015 03:25 AM, Lionel Duboeuf wrote:
>> hello you all,
>>
>> Just in case i'm doing it not efficiently and to learn best practices:
>> I have a character stream that is formated like this one:
>>
>> ...<6 orange/> <2 20/> <1 1/> <2 20/> <5 false/> <1 0/> <16 orange
>> mechanics/> <2 25/>...
>>
>> which correspond to a row column format like this
>> t = {
>>     {  "col1" = "orange" ,  "col2" = 20  },
>>     {  "col1" = 1 ,  "col2" = 20  },
>>     {  "col1" = false ,  "col2" = 0  },
>>     {  "col1" = "orange mechanics" ,  "col2" = 25  },
>> ...
>> }
>>
>>
>>
>> to do so, i parse it like this:
>>
>> pos = current position of the stream
>>
>> local rs = { }
>> local sNbByte, nbByte, val, _
>> local nbRows = 4
>> cols = { "col1","col2" }
>> for i = 1,  nbRows do
>>
>>       local row = {}
>>
>>       for j = 1, #cols do
>>
>>         _, pos, sNbByte = string.find(data, "<(%d+)%s",pos)
>>         nbByte = tonumber(sNbByte)
>>
>>         if (nbByte > 0) then
>>           val = string.sub(data, pos, pos + nbByte)
>>           pos = pos + nbByte
>>         end
>>
>>         pos = pos + 1 --just after value
>>
>>         row[cols[j]] = val
>>
>>       end
>>
>>     rs[i] = row
>>   end
>>
>>
>>
>> i did some benchmarks, and found using gmatch and iterating trough
>> captures more efficient, but it is not usable when we need to specify a
>> starting offset position (like string.find) and i don't want to split my
>> string to avoid copies.
>>
>> any advices will be very appreciated.
>>
>> thanks
>>
>> lionel
>>
>>
>

Follow-Ups:
- Re: parsing improvement, Lionel Duboeuf

References:
- Re: parsing improvement, Shunsuke Shimizu

Prev by Date: Re: parsing improvement
Next by Date: Re: parsing improvement
Previous by thread: Re: parsing improvement
Next by thread: Re: parsing improvement
Index(es):
- Date
- Thread