Split Join |
|
There are various ways to design and implement these functions in Lua, as described below.
With Lua 5.x you can use
table.concat[3] for
joining: table.concat(tbl, delimiter_str)
.
table.concat({"a", "b", "c"}, ",") --> "a,b,c"
Other interfaces are possible, largely dependent on the choice of split interface since join is often intended to be the inverse operation of split.
First of all, although Lua does not have a split function in its standard library, it does have string.gmatch
[4], which can be used instead of a split function in many cases. Unlike a split function, string.gmatch
takes a pattern to match the non-delimiter text, instead of the delimiters themselves:
local example = "an example string" for i in string.gmatch(example, "%S+") do print(i) end -- output: -- an -- example -- string
A split
[1] function
separates a string into a list of substrings,
breaking the original string on occurrences of some separator (character,
character set, or pattern). There are various ways to design a string split function. A summary of the design decisions is listed below.
Should split return a table array, a list, or an iterator?
split("a,b,c", ",") --> {"a", "b", "c"} split("a,b,c", ",") --> "a","b","c" (not scalable: Lua has a limit of a few thousand return values) for x in split("a,b,c", ",") do ..... end
Should the separator be a string, Lua pattern, LPeg pattern, or regular expression?
split("a +b c", " +") --> {"a ", "b c"} split("a +b c", " +") --> {"a", "+b", "c"} split("a +b c", some_other_object) --> .....
How should empty separators be handled?
split("abc", "") --> {"a", "b", "c"} split("abc", "") --> {"", "a", "b", "c", ""} split("abc", "") --> error split("abc", "%d*") --> what about patterns that can evaluate to empty strings?
split(s,"")
is a convenient idiom for splitting a string into characters. In Lua, we can alternately do for c in s:gmatch"." do ..... end
.
How should empty values be handled?
split(",,a,b,c,", ",") --> {"a", "b", "c"} split(",,a,b,c,", ",") --> {"", "", "a", "b", "c", ""} split(",", ",") --> {} or {""} or {"", ""} ? split("", ",") --> {} or {""} ?
join({"",""}, "")
, join({""}, "")
and join({}, "")
all result in the same string ""
. Therefore, the choice of what the inverse operation split("", "")
should return is not immediately clear.
Should there be an argument to limit the number of splits?
split("a,b,c", ",", 2) --> {"a", "b,c"}
Should the separator be returned? This is more useful when the separator is a pattern, in which case the separator can vary:
split("a b c", " +") --> {"a", " ", "b", " ", "c"}
string.gmatch
[5] is in a way a dual of split
, returning the substrings that match a pattern and discarding strings between them rather than the other way around. A function that returns both is sometimes called partition
[6].
string.gsub
/string.match
Break a string up at occurrences of a single character. If the number of fields is known:
str:match( ("([^"..sep.."]*)"..sep):rep(nsep) )
If the number of fields is not known
fields = {str:match((str:gsub("[^"..sep.."]*"..sep, "([^"..sep.."]*)"..sep)))}
Some might call the above a hack :) sep
will need to be escaped if it is a
pattern metacharacter, and you'd probably be better off precomputing and/or
memorizing the patterns. And it leaves off values after the last separator. E.g. "a,b,c" returns "a" and "b" but not "c"
string.gsub
fields = {} str:gsub("([^"..sep.."]*)"..sep, function(c) table.insert(fields, c) end)
Does not work as expected:
str, sep = "1:2:3", ":" fields = {} str:gsub("([^"..sep.."]*)"..sep, function(c) table.insert(fields, c) end) for i,v in ipairs(fields) do print(i,v) end -- output: -- 1 1 -- 2 2
Fix:
function string:split(sep) local sep, fields = sep or ":", {} local pattern = string.format("([^%s]+)", sep) self:gsub(pattern, function(c) fields[#fields+1] = c end) return fields end
Example: split a string into words, or return nil
function justWords(str) local t = {} local function helper(word) table.insert(t, word) return "" end if not str:gsub("%w+", helper):find"%S" then return t end end
This splits a string using the pattern sep
. It calls func
for each
segment. When func
is called, the first argument is the segment and the
remaining arguments are the captures from sep
, if any. On the last
segment, func
will be called with just one argument. (This could be used
as a flag, or you could use two different functions). sep
must not match
the empty string. Enhancements are left as an exercise :)
func((str:gsub("(.-)("..sep..")", func)))
Example: Split a string into lines separated by either DOS or Unix line endings, creating a table out of the results.
function lines(str) local t = {} local function helper(line) table.insert(t, line) return "" end helper((str:gsub("(.-)\r?\n", helper))) return t end
The problem with using gsub as above is that it can't handle the case when the separator pattern doesn't appear at the end of the string. In that case the final "(.-)" never gets to capture the end of the string, because the overall pattern fails to match. To handle that case you have to do something a little more complicated. The split function below behaves more or less like split in perl or python. In particular, single matches at the beginning and end of the string do not create new elements. Multiple matches in a row create empty string elements.
-- Compatibility: Lua-5.1 function split(str, pat) local t = {} -- NOTE: use {n = 0} in Lua-5.0 local fpat = "(.-)" .. pat local last_end = 1 local s, e, cap = str:find(fpat, 1) while s do if s ~= 1 or cap ~= "" then table.insert(t, cap) end last_end = e+1 s, e, cap = str:find(fpat, last_end) end if last_end <= #str then cap = str:sub(last_end) table.insert(t, cap) end return t end
Example: Split a file path string into components.
function split_path(str) return split(str,'[\\/]+') end parts = split_path("/usr/local/bin") --> {'usr','local','bin'}
Test Cases:
split('foo/bar/baz/test','/') --> {'foo','bar','baz','test'} split('/foo/bar/baz/test','/') --> {'foo','bar','baz','test'} split('/foo/bar/baz/test/','/') --> {'foo','bar','baz','test'} split('/foo/bar//baz/test///','/') --> {'foo','bar','','baz','test','',''} split('//foo////bar/baz///test///','/+') --> {'foo','bar','baz','test'} split('foo','/+') --> {'foo'} split('','/+') --> {} split('foo','') -- opps! infinite loop!
After a discussion on this topic in the mailing list, I made my own function... I took, unknowingly, a way similar to the function above, except I use gfind to iterate, and I see the single matches at beginning and end of string as empty fields. As above, multiple successive delimiters create empty string elements.
-- Compatibility: Lua-5.0 function Split(str, delim, maxNb) -- Eliminate bad cases... if string.find(str, delim) == nil then return { str } end if maxNb == nil or maxNb < 1 then maxNb = 0 -- No limit end local result = {} local pat = "(.-)" .. delim .. "()" local nb = 0 local lastPos for part, pos in string.gfind(str, pat) do nb = nb + 1 result[nb] = part lastPos = pos if nb == maxNb then break end end -- Handle the last field if nb ~= maxNb then result[nb + 1] = string.sub(str, lastPos) end return result end
Test Cases:
ShowSplit("abc", '') --> { [1] = "", [2] = "", [3] = "", [4] = "", [5] = "" } -- No infite loop... but garbage in, garbage out... ShowSplit("", ',') --> { [1] = "" } ShowSplit("abc", ',') --> { [1] = "abc" } ShowSplit("a,b,c", ',') --> { [1] = "a", [2] = "b", [3] = "c" } ShowSplit("a,b,c,", ',') --> { [1] = "a", [2] = "b", [3] = "c", [4] = "" } ShowSplit(",a,b,c,", ',') --> { [1] = "", [2] = "a", [3] = "b", [4] = "c", [5] = "" } ShowSplit("x,,,y", ',') --> { [1] = "x", [2] = "", [3] = "", [4] = "y" } ShowSplit(",,,", ',') --> { [1] = "", [2] = "", [3] = "", [4] = "" } ShowSplit("x!yy!zzz!@", '!', 4) --> { [1] = "x", [2] = "yy", [3] = "zzz", [4] = "@" } ShowSplit("x!yy!zzz!@", '!', 3) --> { [1] = "x", [2] = "yy", [3] = "zzz" } ShowSplit("x!yy!zzz!@", '!', 1) --> { [1] = "x" } ShowSplit("a:b:i:p:u:random:garbage", ":", 5) --> { [1] = "a", [2] = "b", [3] = "i", [4] = "p", [5] = "u" } ShowSplit("hr , br ; p ,span, div", '%s*[;,]%s*') --> { [1] = "hr", [2] = "br", [3] = "p", [4] = "span", [5] = "div" }
Many people miss Perl-like split/join functions in Lua. Here are mine:
-- Concat the contents of the parameter list, -- separated by the string delimiter (just like in perl) -- example: strjoin(", ", {"Anna", "Bob", "Charlie", "Dolores"}) function strjoin(delimiter, list) local len = getn(list) if len == 0 then return "" end local string = list[1] for i = 2, len do string = string .. delimiter .. list[i] end return string end -- Split text into a list consisting of the strings in text, -- separated by strings matching delimiter (which may be a pattern). -- example: strsplit(",%s*", "Anna, Bob, Charlie,Dolores") function strsplit(delimiter, text) local list = {} local pos = 1 if strfind("", delimiter, 1) then -- this would result in endless loops error("delimiter matches empty string!") end while 1 do local first, last = strfind(text, delimiter, pos) if first then -- found? tinsert(list, strsub(text, pos, first-1)) pos = last+1 else tinsert(list, strsub(text, pos)) break end end return list end
Here's my own split function, for comparison. It's largely the same as the above; not quite as DRY but (IMO) slightly cleaner. It doesn't use gfind (as suggested below) because I wanted to be able to specify a pattern for the split string, not a pattern for the data sections. If speed is paramount, it might be made faster by caching string.find as a local 'strfind' variable, as the above does.
--Written for 5.0; could be made slightly cleaner with 5.1 --Splits a string based on a separator string or pattern; --returns an array of pieces of the string. --(May optionally supply a table as the third parameter which will be filled with the results.) function string:split( inSplitPattern, outResults ) if not outResults then outResults = { } end local theStart = 1 local theSplitStart, theSplitEnd = string.find( self, inSplitPattern, theStart ) while theSplitStart do table.insert( outResults, string.sub( self, theStart, theSplitStart-1 ) ) theStart = theSplitEnd + 1 theSplitStart, theSplitEnd = string.find( self, inSplitPattern, theStart ) end table.insert( outResults, string.sub( self, theStart ) ) return outResults end
Explode string into table with seperator (moved from TableUtils):
-- explode(seperator, string) function explode(d,p) local t, ll t={} ll=0 if(#p == 1) then return {p} end while true do l = string.find(p, d, ll, true) -- find the next d in the string if l ~= nil then -- if "not not" found then.. table.insert(t, string.sub(p,ll,l-1)) -- Save it in our array. ll = l + 1 -- save just after where we found it for searching next time. else table.insert(t, string.sub(p,ll)) -- Save what's left in our array. break -- Break at end, as it should be, according to the lua manual. end end return t end
Here's my version of PHP style explode, supporting limit
function explode(sep, str, limit) if not sep or sep == "" then return false end if not str then return false end limit = limit or mhuge if limit == 0 or limit == 1 then return {str}, 1 end local r = {} local n, init = 0, 1 while true do local s,e = strfind(str, sep, init, true) if not s then break end r[#r+1] = strsub(str, init, s - 1) init = e + 1 n = n + 1 if n == limit - 1 then break end end if init <= strlen(str) then r[#r+1] = strsub(str, init) else r[#r+1] = "" end n = n + 1 if limit < 0 then for i=n, n + limit + 1, -1 do r[i] = nil end n = n + limit end return r, n end
This function uses a metatable's __index function to populate the table of split parts. This function does not try to (correctly) invert the pattern, and so really doesn't work as most string split functions do.
--[[ written for Lua 5.1 split a string by a pattern, take care to create the "inverse" pattern yourself. default pattern splits by white space. ]] string.split = function(str, pattern) pattern = pattern or "[^%s]+" if pattern:len() == 0 then pattern = "[^%s]+" end local parts = {__index = table.insert} setmetatable(parts, parts) str:gsub(pattern, parts) setmetatable(parts, nil) parts.__index = nil return parts end -- example 1 str = "no separators in this string" parts = str:split( "[^,]+" ) print( # parts ) table.foreach(parts, print) --[[ output: 1 1 no separators in this string ]] -- example 2 str = " split, comma, separated , , string " parts = str:split( "[^,%s]+" ) print( # parts ) table.foreach(parts, print) --[[ output: 4 1 split 2 comma 3 separated 4 string ]]
This is the Python behavior:
Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51) [GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2 >>> 'x!yy!zzz!@'.split('!') ['x', 'yy', 'zzz', '@'] >>> 'x!yy!zzz!@'.split('!', 3) ['x', 'yy', 'zzz', '@'] >>> 'x!yy!zzz!@'.split('!', 2) ['x', 'yy', 'zzz!@'] >>> 'x!yy!zzz!@'.split('!', 1) ['x', 'yy!zzz!@']
And IMHO this Lua function implements this semantics:
function string:split(sSeparator, nMax, bRegexp) assert(sSeparator ~= '') assert(nMax == nil or nMax >= 1) local aRecord = {} if self:len() > 0 then local bPlain = not bRegexp nMax = nMax or -1 local nField, nStart = 1, 1 local nFirst,nLast = self:find(sSeparator, nStart, bPlain) while nFirst and nMax ~= 0 do aRecord[nField] = self:sub(nStart, nFirst-1) nField = nField+1 nStart = nLast+1 nFirst,nLast = self:find(sSeparator, nStart, bPlain) nMax = nMax-1 end aRecord[nField] = self:sub(nStart) end return aRecord end
Observe the possibility to use simple strings or regular expressions as delimiters.
Test Cases:
Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio ... > for k,v in next, string.split('x!yy!zzz!@', '!') do print(v) end x yy zzz @ > for k,v in next, string.split('x!yy!zzz!@', '!', 3) do print(v) end x yy zzz @ > for k,v in next, string.split('x!yy!zzz!@', '!', 2) do print(v) end x yy zzz!@ > for k,v in next, string.split('x!yy!zzz!@', '!', 1) do print(v) end x yy!zzz!@
function gsplit(s,sep) return coroutine.wrap(function() if s == '' or sep == '' then coroutine.yield(s) return end local lasti = 1 for v,i in s:gmatch('(.-)'..sep..'()') do coroutine.yield(v) lasti = i end coroutine.yield(s:sub(lasti)) end) end -- same idea without coroutines function gsplit2(s,sep) local lasti, done, g = 1, false, s:gmatch('(.-)'..sep..'()') return function() if done then return end local v,i = g() if s == '' or sep == '' then done = true return s end if v == nil then done = true return s:sub(lasti) end lasti = i return v end end The {{gsplit()}} above returns an iterator, so other API variants can be easily derived from it: {{{!Lua function iunpack(i,s,v1) local function pass(...) local v1 = i(s,v1) if v1 == nil then return ... end return v1, pass(...) end return pass() end function split(s,sep) return iunpack(gsplit(s,sep)) end function accumulate(t,i,s,v) for v in i,s,v do t[#t+1] = v end return t end function tsplit(s,sep) return accumulate({}, gsplit(s,sep)) end
Note that the above implementation does not allow captures in the separator. To allow for that, another closure must be created to pass along the additional captured strings (see VarargTheSecondClassCitizen). The semantics also get muddy (I suppose one use case would be wanting to know what the actual separator was for each string, eg. for a separator pattern like [%.,;]).
function gsplit(s,sep) local i, done, g = 1, false, s:gmatch('(.-)'..sep..'()') local function pass(...) if ... == nil then done = true return s:sub(i) end i = select(select('#',...),...) return ... end return function() if done then return end if s == '' or sep == '' then done = true return s end return pass(g()) end end
The problem with the above implementation is that however easy to read, the (.-) pattern in Lua has awful performance, hence the following implementation based on only string.find (allows for captures in the separator and adds a third argument "plain", similar to string.find):
function string.gsplit(s, sep, plain) local start = 1 local done = false local function pass(i, j, ...) if i then local seg = s:sub(start, i - 1) start = j + 1 return seg, ... else done = true return s:sub(start) end end return function() if done then return end if sep == '' then done = true return s end return pass(s:find(sep, start, plain)) end end
Unit testing:
local function test(s,sep,expect) local t={} for c in s:gsplit(sep) do table.insert(t,c) end assert(#t == #expect) for i=1,#t do assert(t[i] == expect[i]) end test(t, expect) end test('','',{''}) test('','asdf',{''}) test('asdf','',{'asdf'}) test('', ',', {''}) test(',', ',', {'',''}) test('a', ',', {'a'}) test('a,b', ',', {'a','b'}) test('a,b,', ',', {'a','b',''}) test(',a,b', ',', {'','a','b'}) test(',a,b,', ',', {'','a','b',''}) test(',a,,b,', ',', {'','a','','b',''}) test('a,,b', ',', {'a','','b'}) test('asd , fgh ,; qwe, rty. ,jkl', '%s*[,.;]%s*', {'asd','fgh','','qwe','rty','','jkl'}) test('Spam eggs spam spam and ham', 'spam', {'Spam eggs ',' ',' and ham'})
-- single char string splitter, sep *must* be a single char pattern -- *probably* escaped with % if it has any special pattern meaning, eg "%." not "." -- so good for splitting paths on "/" or "%." which is a common need local function csplit(str,sep) local ret={} local n=1 for w in str:gmatch("([^"..sep.."]*)") do ret[n] = ret[n] or w -- only set once (so the blank after a string is ignored) if w=="" then n = n + 1 end -- step forwards on a blank but not a string end return ret end -- the following is true of any string, csplit will do the reverse of a concat local str="" print(str , assert( table.concat( csplit(str,"/") , "/" ) == str ) ) local str="only" print(str , assert( table.concat( csplit(str,"/") , "/" ) == str ) ) local str="/test//ok/" print(str , assert( table.concat( csplit(str,"/") , "/" ) == str ) ) local str=".test..ok." print(str , assert( table.concat( csplit(str,"%.") , "." ) == str ) )
Up to Lua 5.3.2, splitting was tricky in most situations, because string.gmatch
and string.gsub
introduce spurious extra empty fields (as in Perl). From Lua 5.3.3 onwards, they no longer do that, they now behave as in Python. Therefore, the following minimalistic splitting function now is a true inverse of table.concat
; previously it was not.
-- splits 'str' into pieces matching 'pat', returns them as an array local function split(str,pat) local tbl = {} str:gsub(pat, function(x) tbl[#tbl+1]=x end) return tbl end local str = "a,,b" -- comma-separated list local pat = "[^,]*" -- everything except commas assert (table.concat(split(str, pat), ",") == str)
'.-'
search patterns'.-'
(non-greedy) pattern in the previous sections can be solved by anchoring it to the start position in the search string (the start position is not necessarily the first position in the string, if we use string.find()
with its third parameter), so that the subpattern for matching the separator can be greedy (but note that if the separator is pattern that can match an empty string, an empty match will be found just before the start of the text, with an empty delimited string and an empty separator, so it could produce an infinite loop: don't specify any pattern for the separator that can match an empty string).
So to search the first separator ;
in string str
from a starting position p
, we can use:
q, r = str:find('^.-;', p)
As well we don't need any capture to call string.find()
when the separator is static: if there's a match q will be equal to p (because the pattern is anchored at start), and r will be just on the last character of the separator. As we can determine the length of the separator before the loop scanning the string to split with a simple k = #sep
initialisation, the new delimited word will be str:sub(q, r - k)
. But a static plain separator must first be transformed in the search pattern, by escaping its "magic characters" with a '%'
prefix.
However if the separator must be a pattern, the effective separator found may have variable length, so you need to capture the text before the separator, and the full pattern to search is ('^(.-)' .. sep)
q, r, s = str:find('^(.-);', p)
If there's a match, q
will be equal to the start position p
, r
will be the position of the last character of the separator (to be used for the next loop), and s
will be the first capture, i.e. the word starting at position p
(or q
) but before the non-captured separator.
This gives the following efficent function:
function split(str, sep, plain, max) local result, count, first, found, last, word = {}, 1, 1 if plain then sep = sep:gsub('[$%%()*+%-.?%[%]^]', '%%%0') end sep = '^(.-)' .. sep repeat found, last, word = str:find(sep, first) if q then result[count], count, first = word, count + 1, last + 1 else result[count] = str:sub(first) break end until count == max return result end
Like the previous functions, you can pass an optional parameter plain
set to true for searching a plain separator (that will be converted to a pattern), and a max
parameter to limit the number of items in the returned array (if this limit is reached, the last delimited word returned does not contain any occurence of the separator, so the rest of the text is ignored in this implementation). Also note that empty strings delimited by separators may be returned (up to as many empty strings as there are occurences of separators found)
So:
split(';;A', ';')
will return {"", "", "A"}
split(';;A', ';', true, 2)
will return {"", ""}
The most compact splitting function that can be used with a plain separator (e.g. a '\n'
newline with a single encoding, or a single ';'
semicolon, or a single '\t'
tabulation control, or a longer sequence like '--'
) is this one:
local function splitByPlainSeparator(str, sep, max) local z = #sep; sep = '^.-'..sep:gsub('[$%%()*+%-.?%[%]^]', '%%%0') local t,n,p, q,r = {},1,1, str:find(sep) while q and n~=max do t[n],n,p = s:sub(q,r-z),n+1,r+1 q,r = str:find(sep,p) end t[n] = str:sub(p) return t end
The most compact splitting function that can be used with a pattern separator (e.g. a variable newline like '\r?[\n\v\f]'
or any sequence of whitespace like '%s+'
, or a comma optionally surrounded by greedy whitespaces like '%s*,%s*'
) is this one:
local function splitByPatternSeparator(str, sep, max) sep = '^(.-)'..sep local t,n,p, q,r,s = {},1,1, str:find(sep) while q and n~=max do t[n],n,p = s,n+1,r+1 q,r,s = str:find(sep,p) end t[n] = str:sub(p) return t end
This last function however still does not support separators that can be one of several alternatives (because Lua has no |
in its patterns), but you can circumvent this limitation by using several patterns, and using str:find()
in an innner subloop to locate each possible alternative and take the minimum position found, using a small loop on each alternative pattern (for example to split with the extended pattern '\r?\n|\r|<br%s*/?>'
):
local function splitByExtendedPatternSeparator(str, seps, max) -- Split the extended pattern into a sequence of Lua patterns, using the function defined above. -- Note: '|' cannot be part of any subpattern alternative for the separator (no support here for any escape). -- Alternative: just pass "seps" as a sequence of standard Lua patterns built like below, with a non-greedy -- pattern anchored at start for the contextual text accepted in the returned texts betweeen separators, -- and the empty capture '()' just before the pattern for a single separator. if type(seps) == 'string' then seps = splitByPlainSeparator(sep, '|') -- Adjust patterns for i, sep in ipairs(seps) do seps[i] = '^.-()' .. sep end end -- Now the actual loop to split the first string parameter local t, n, p = {}, 1, 1 while n ~= max do -- locate the nearest subpatterns that match a separator in str:sub(p); -- if two subpatterns match at same nearest position, keep the longest one local first, last = nil for i, sep in ipairs(seps) do local q, r, s = str:find(sep, p) if q then -- A possible separator (not necessarily the neareast) was found in str:sub(s, r) -- Here: q~=nil, r~=nil, s~=nil, q==p <= s <= r) if not first or s < first then first, last = s, r -- this also overrides any longer pattern, but located later elseif r > last then last = r -- prefer the longest pattern at the same position end end end if not first then break end -- the nearest separator (with the longest length) was found in str:sub(first, last) t[n], n, p = str:sub(p, first - 1), n + 1, last + 1 end t[n] = str:sub(p) return t end
These last three functions (nearly equivalent, but not exactly for the same purpose) both allow searching for all occurences of any separator (which is not limited to one character), they also have an optional max
parameter (only used in the condition for a single while
statement).:
max
is set to a positive integer larger than 1, then there will not be more than this number of substrings returned, the last one containing all the rest of the text (including separators), so it stops splitting after finding the (max-1)
first occurences;
max==1
, they both return the original text;
max
is nil, or not an integer, or negative or zero, all separators will be scanned and the returned table will have K+1 string items if the source text contained k occurences of the separator, or will return a single string, identical to the input text if it contains no occurence at all of the separator.
str
and sep
are required and should both be strings not nil; but max
is optional and to have an effect it should be a positive integer.
If you never need the max
parameter (i.e. behave like if it was nil
above and so split the full text to remove all occurences of the plain or pattern separator), just remove the condition and n~=max
in the first line of these while
statements.
Note also that these functions above are stipping all the separators in the return table. You may want to have variable "separators" that you'll want to get a copy for different behavior. The modification is trivial: in the while
loop of the 3 functions above, just append two strings instead of just one: the separated words will be at odd positions (starting from 1) in the returned table (which will have an add number of items), and the separators will be at even positions (starting from 2) if there's an occurence.
This allows creating a simple lexical parser, where "separators" (defined as an "extended pattern" like above or a table of patterns) will be the lexical tokens, and the "non-separators" will be the extra optional whitespaces, not matched by the tokens. In the sample code below, the extended pattern uses the null character ('\000'
in Lua string litterals) instead of the pipe, to separate alternate subpatterns matching individual tokens.
local function splitTokens(str, tokens, max) -- Split the extended pattern into a sequence of Lua patterns, using the function defined above. -- Note: '\000' cannot be part of any subpattern alternative for the separator (no support here for any escape). -- Alternative: just pass "seps" as a sequence of standard Lua patterns built like below, with a non-greedy -- pattern anchored at start for the contextual text accepted in the returned texts betweeen separators, -- and the empty capture '()' just before the pattern for a single separator. if type(tokens) == 'string' then tokens = splitByPlainSeparator(tokens, '\000') -- Adjust patterns for i, token in ipairs(tokens) do tokens[i] = '^.-()' .. token end end -- Now the actual loop to split the first string parameter local t, n, p = {}, 1, 1 while n ~= max do -- locate the nearest subpatterns that match a separator in str:sub(p); -- if two subpatterns match at same nearest position, keep the longest one local first, last = nil for i, token in ipairs(tokens) do local q, r, s = str:find(token, p) if q then -- A possible token (not necessarily the neareast) was found in str:sub(s, r) -- Here: q~=nil, r~=nil, s~=nil, q==p <= s <= r) if not first or s < first then first, last = s, r -- this also overrides any longer pattern, but located later elseif r > last then last = r -- prefer the longest pattern at the same position end end end if not first then break end -- The nearest token (with the longest length) was found in str:sub(first, last). -- Store the non-token part (possibly empty) at odd position, and the token at the next even position t[n], t[n + 1], n, p = str:sub(p, first - 1), str:sub(first, last), n + 2, last + 1 end t[n] = str:sub(p) -- Store the last non-token (possibly empty) at odd position return t end
So you can call this for example to tokenize a text containing identifiers, integers or floating point numbers (like those in the Lua syntax), or isolated non-space symbols (you can add tokens for longer symbols, or to support other literals, by adding more alternatives to the extended pattern):
splitTokens(str, '[%a_][%w_]+' .. '\000' .. '%d+[Ee][%-%+]?%d+' .. '\000' .. '%d+%.?%d*' .. '\000' .. '%.%d+[Ee][%-%+]?%d+' .. '\000' .. '%.%d+' .. '\000' .. '[^%w_%s]')
(verdy_p)
I mean no disrespect, of course, but.. does anyone actually have a working split function without glitches like infinite loops, wrong matches, or error cases? Are all those "takes" of any help here? -- CosminApreutesei
Try Rici Lake's split function: LuaList:2006-12/msg00414.html -- Jörg Richter