[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: A tutorial on gsub optimization
- From: Philippe Lhoste <PhiLho@...>
- Date: Thu, 3 Jul 2003 10:00:35 +0200 (MEST)
I will give here a little trick to save memory and time when manipulating
strings.
This trick is probably obvious to seasoned Lua programmers, but I believe it
can be useful for beginners to
give some facts about Lua strings, as they may not be obvious from the
manual. And in any case, it is
interesting to restate them.
Note I wrote the given code in Lua 5.0 syntax, but the information is valid
for Lua 4.0 as well.
Some time ago, I wrote a little routine to convert French accented
characters to HTML entities, for use in a
demo Lua CGI program.
I give it below. No rocket science, simple (simplistic?) and obvious,
nothing worth mentioning here, if not
for showing how to improve it...
>>>
-- Windows to HTML
local entities =
{
-- ['&'] = "&",
['<'] = "<",
['>'] = ">",
-- French entities (the most common ones)
['?'] = "à",
['?'] = "â",
['È'] = "é",
['Ë'] = "è",
['Í'] = "ê",
['Î'] = "ë",
['Ó'] = "î",
['Ô'] = "ï",
['Ù'] = "ô",
['?'] = "ö",
['<breve>'] = "ù",
['?'] = "û",
-- ... shortened
}
function EncodeEntities1(toEncode)
if toEncode == nil or type(toEncode) ~= "string" then
return ''
end
local EncodeHighAscii = function (char)
local code = string.byte(char)
if code > 127 then
return string.format("&#%d;", code)
else
return char
end
end
local encodedString = toEncode
-- First encode '&' char, to avoid re-encoding already encoded chars
encodedString = string.gsub(encodedString, '&', "&")
-- Encode known entities
for char, entity in entities do
encodedString = string.gsub(encodedString, char, entity)
end
-- Encode unknown high Ascii characters to numerical entities
encodedString = string.gsub(encodedString, '(.)', EncodeHighAscii)
return encodedString
end
<<<
I used a similar routine to convert Windows accents to Mac ones.
>>>
-- PC (Windows) to Mac
local chars =
{
['?'] = 'à', -- à
['?'] = 'â', -- â
['Á'] = 'ç', -- ç
['È'] = 'é', -- é
['Ë'] = 'è', -- è
['Í'] = 'ê', -- ê
['Î'] = 'ë', -- ë
['Ó'] = 'î', -- î
['Ô'] = 'ï', -- ï
['Ù'] = 'ô', -- ô
['?'] = 'ö', -- ö
['<breve>'] = 'ù', -- ù
['?'] = 'û', -- û
-- ... shortened
}
function EncodeChars(toEncode)
if toEncode == nil or type(toEncode) ~= "string" then
return ''
end
local encodedString = toEncode
-- Encode known chars
for charPC, charMac in chars do
encodedString = string.gsub(encodedString, charPC, charMac)
end
encodedString = string.gsub(encodedString, '\n', '\r') -- Read in text
mode, EOLs are only \n
return encodedString
end
<<<
I shortened the tables, they will be messed up anyway, because I send this
article from the Mac of my
wife ;-) That's why I made the second routine, BTW...
Now, if you know some little things about the way Lua manages strings, and
think a bit about these
routines, you will see they are inefficient.
In Lua, strings are immutables. That means that if you change even only one
byte from a string (using
string.gsub, or concatenating another string, the only ways to alter a
string in standard Lua, beside
string.upper and string.lower), Lua will make a copy of the string and do
the change. If the new string is
set to the same variable, the old string will be left without reference and
the garbage collector will take
care of it.
That is because Lua is making a hash value out of any string, and checks if
it already exists in its string
table. If it does, it just makes a new reference to that string, avoiding
duplicates. And this also quicken
string comparisons. But that means that Lua must compute this hash value for
each and every string, even
"intermediate" ones, and cannot change an existing string, because it can
have multiple references.
The downside is that if you make multiple distinct changes to a string, a
lot of copies will be generated,
eating memory and taking time to collect.
The solution, when possible, is to make as many changes as possible in one
call. Ie. instead of using
multiple gsubs, try to make only one. And instead of looping on ..
concatenation, use table.concat if
possible. That is because in these calls, even very complex, Lua will
manipulate the string in C buffer,
without rehashing. Only the final, resulting string will become a Lua
string.
So, a way to improve this routine is to collapse all calls to gsub into one.
For this purpose, I build a regular
expression allowing to search all accented characters at once:
>>>
function EncodeEntities2(toEncode)
if toEncode == nil or type(toEncode) ~= "string" then
return ''
end
local EncodeToEntities = function (char)
return entities[char] or char
end
local EncodeHighAscii = function (char)
local code = string.byte(char)
if code > 127 then
return string.format("&#%d;", code)
else
return char
end
end
local encodedString = toEncode
local encodingString = "(["
entities['&'] = "&" -- Not needed if directly put in the table
-- Add all characters to encode to the encodingString
for char, entity in entities do
encodingString = encodingString .. char
end
encodingString = encodingString .. "])"
-- Encode known characters to entities
encodedString = string.gsub(encodedString, encodingString,
EncodeToEntities)
-- Encode unknown high Ascii characters to numerical entities
encodedString = string.gsub(encodedString, '(.)', EncodeHighAscii)
return encodedString
end
<<<
Note that this method cannot be used in the reverse conversion (HTML to
Windows, but it can be used for
Mac to Windows) because the entities have several characters, and current
Lua regular expressions don't
allow alternative strings, only alternative characters. One way to bypass
this is to use a third party RE
library, another way may be to search "&([^;]+);" and see if the captured
string is in the table.
There is still a little inefficiency in the second algorithm as I do
multiple concatenations. But table.concat
doesn't work here, unless I make a secondary table. Since the table is quite
small, I don't think it is worth
it.
There is another inefficiency, more annoying, because there is no way to
search high Ascii characters in a
regular expression. So I have to make another scan, checking each character
of the string. It is quite slow
and generates a copy of the string.
If that's so, why not check every character and see if it has to be
converted? This gives the ultimate
routine:
>>>
function EncodeEntities3(toEncode)
if toEncode == nil or type(toEncode) ~= "string" then
return ''
end
local EncodeToEntities = function (char)
local entity = entities[char]
if entity == nil then
local code = string.byte(char)
if code > 127 then
entity = string.format("&#%d;", code)
end
end
return entity or char
end
entities['&'] = "&"
encodedString = string.gsub(toEncode, '(.)', EncodeToEntities)
return encodedString
end
<<<
It is quite powerful, as it shows that complex treatments can be done on the
conversion function of the
gsub.
To test these routines, I put them all in one file (that's the reason why I
had to add the '&' entry instead of
putting it directly in the table). And I wrote the following code:
>>>
print("Initial GC: ", gcinfo())
dofile"3EncodeEntities.lua"
local stdin, stdout
if not io.input("BigFile.txt") then
return -- Actually Lua will choke on input
end
stdin = io.read("*a")
io.input() -- Restore default reading
-- Clean up
collectgarbage()
-- Set limit high enough to avoid having GC during algorithm.
-- This allows to check the maximum memory use.
collectgarbage(30000)
print("GC after reading file: ", gcinfo())
win32.StartFineClock()
if EncodeEntities1 then
stdout = EncodeEntities1(stdin)
-- io.write(stdout)
end
print("First algo running time: ", (win32.GetFineClockTime()))
print("GC after first algo: ", gcinfo())
-- Clean up
stdout = nil
collectgarbage()
-- Reset the high limit
collectgarbage(30000)
print("Garbage collected: ", gcinfo())
win32.StartFineClock()
if EncodeEntities2 then
stdout = EncodeEntities2(stdin)
-- io.write(stdout)
end
print("Second algo running time: ", (win32.GetFineClockTime()))
print("GC after second algo: ", gcinfo())
-- Clean up
stdout = nil
collectgarbage()
-- Reset the high limit
collectgarbage(30000)
print("Garbage collected: ", gcinfo())
win32.StartFineClock()
if EncodeEntities3 then
stdout = EncodeEntities3(stdin)
-- io.write(stdout)
end
print("Third algo running time: ", (win32.GetFineClockTime()))
print("GC after third algo: ", gcinfo())
-- Final clean up
stdout = nil
stdin = nil
collectgarbage()
print("Final GC: ", gcinfo())
<<<
BigFile.txt is a repetition of French text, with accents and some <, > and
&, plus some complete sets of
high Ascii characters, up to 100,000 chars.
The win32 table is from my own Win32 library, you can replace the calls to
XxxFineClockTime by calls to
clock(). My functions are a bit more precise, using the Pentium performance
counter.
To measure the memory use, I set the limit to garbage collection quite high,
so it won't kick in during the
encoding, and gcinfo will return accurate value. I collect garbage between
the various algorithms, to
restart on fresh ground.
The results are (on my slow PII 300 with Win98):
C:\Dev\LuaPhiLhoSoft> LuaWin32 Test3EncodeEntities.lua
Initial GC: 28 56
GC after reading file: 156 30000
First algo running time: 2.8918805209608
GC after first algo: 21230 30000
Garbage collected: 226 30000
Second algo running time: 0.9387368209323
GC after second algo: 1440 30000
Garbage collected: 210 30000
Third algo running time: 0.62521078127357
GC after third algo: 828 30000
Final GC: 232 465
We clearly see that we have an improvement of both speed and memory use.
Note that the improvement of the third algorithm isn't absolute. If we
remove the high Ascii encoding (eg.
in the PC to Mac conversion, putting the EOL conversion in the table), the
second algorithm wins, because
it doesn't have to run the Lua function on each and every character of the
file:
C:\Dev\LuaPhiLhoSoft> LuaWin32 Test3EncodeEntities.lua
Initial GC: 28 56
GC after reading file: 156 30000
First algo running time: 2.2509822491158
GC after first algo: 20666 30000
Garbage collected: 224 30000
Second algo running time: 0.31407331668315
GC after second algo: 814 30000
Garbage collected: 209 30000
Third algo running time: 0.35992054845036
GC after third algo: 893 30000
Final GC: 228 457
Note that if we keep the GC limit to default value, the memory consumption
isn't as dramatic as shown, as
memory is collected during the loop of gsub calls. And we don't see
signifiant increase of time, because
GC is quite fast here.
This article is a bit lengthly, but I hope it will shed some light on the
internals of Lua and how to make the
best use of this wonderful language.
I will probably put this article on the wiki, if I find time, unless
somebody bored and with some free time
put it there before, possibly after translating it to good English...
Remarks, corrections, clarifications are welcome.
--
--=#=--=#=--=#=--=#=--=#=--=#=--=#=--=#=--=#=--
Philippe Lhoste (Paris -- France)
Professional programmer and amateur artist
http://jove.prohosting.com/~philho/
--=#=--=#=--=#=--=#=--=#=--=#=--=#=--=#=--=#=--
+++ GMX - Mail, Messaging & more http://www.gmx.net +++
Bitte lächeln! Fotogalerie online mit GMX ohne eigene Homepage!