Lazy Sort |
|
The basic quick sort algorithm looks like this:
To quicksort(Set) If Set is empty, then return an empty vector Otherwise Select a value in Set, Pivot Make Left from all values in Set less than Pivot Make Right from all values in Set greater than Pivot Return the concatenation of Quicksort(Left), Pivot, Quicksort(Right)
The binary tree becomes quite obvious if the concatenation in the last step is replaced by making a binary tree node. The final vector can then be created by flattening the binary tree.
Similarly, we could compute factorial(n) by first making a vector of integers from 1 to n and then multiplying the elements of the vector.
In both cases, we have a kind of "virtual" intermediate structure, which we actually eliminate in practice by incorporating it into program flow. [1] However, it can be useful sometimes to reveal the intermediate structure.
For example, suppose we have a very large vector of family incomes, say of a million families. We want to compute some measure of income inequality; a common one is to find the ratio between the average income of the highest quintile (that is, the 20% of the population with the highest incomes) and the lowest quintile (that is, the 20% of the population with the lowest incomes).
To compute this value, we need to semi-sort the vector of incomes. Ideally, we want to just sort it enough so that we can extract the (unordered) first and last quintiles. One general-purpose way of doing this is to run the quick sort algorithm lazily; first, we stop when we reach the node corresponding to the 20% point; and then we start again, stopping when we reach the 80% point. To do that (in general), we need keep at least part of the implicit binary tree around.
One way of doing this is to change quicksort so that the recursion is replaced by a "promise"; that is, instead of actually doing the recursion, we make a function closure which when called will do the next recursion step.
The Lua distribution includes sort.lua
which has a simple implementation of quick sort; slightly simplified, the core is as follows:
function qsort(vec, low, high) if low < high then local middle = partition(vec, low, high) qsort(vec, low, middle-1) qsort(vec, middle+1, high) end end
I've separated out the partition
function and given variables possibly more meaningful names to make the algorithm a little clearer.
The lazy version looks something like this:
-- Ensure that that the element at i is in the right position, -- and return a closure which can be used for continuing the sort. function quicksorter(i, vec, low, high) if low >= high then return quicksorter else -- low < high -- partition the vector and initialize the child closures local middle = partition(vec, low, high) local left, right = quicksorter -- Create the promise local function self(i, vec, low, high) if i < middle then left = left(i, vec, low, middle-1) return self elseif i > middle then right = right(i, vec, middle+1, high) return self end end -- Force the promise until i is in the right position return self(i, vec, low, high) end end function lazysort(vec, low, high) local sorter = quicksorter return function(i) sorter = sorter(i, vec, low, high) return vec[i] end end
I haven't actually tested the code above, but I did write a real version of lazysort; you can download it from [here]. The real version includes a couple of efficiencies: first, it uses an insertion sort for small ranges; and, second, it merges the binary tree so that it is always complete.
The sample code includes a test suite and benchmark, and some commented out pieces which make it possible to view the "virtual" binary tree, by recursing down the closures with a sentinel index value. I used that to make the following illustration of how it works:
First, let's test it:
function rangemean(v, low, high) local sum = v[low] for i = low+1, high do sum = sum + v[i] end return sum / (high - low + 1) end function disparity(v) local getter = lazysort(v) local bottom = math.ceil(#v / 5) getter(bottom) local top = math.floor(#v * 4 / 5) getter(top) return rangemean(v, top, #v) / rangemean(v, 1, bottom) end
> math.randomseed(31415926) > test = {}; for i = 1, 1e6 do test[i] = math.random(1e4) end > now = os.clock(); print(disparity(test)); print(os.clock() - now) 8.9768793870336 0.6796875 > > -- By comparison: > > math.randomseed(31415926) > test = {}; for i = 1, 1e6 do test[i] = math.random(1e4) end > now = os.clock(); table.sort(test); print (rangemean(test,8e5,1e6)/rangemean(test,1,2e5)); print(os.clock() - now) 8.9768793870336 1.8828125
So even though it's written in Lua and table.sort is written in C, it's three times as fast for this computation. In other words, it's not a toy.
Here's how it works. Let's take a slightly smaller sample:
> test = {}; for i = 1, 500 do test[i] = math.random(1e4) end > getter = lazysort(test) > =getter(100) 1954 > > -- By uncommenting the viewing code, we can look at the tree structure: > getter"" Root +-Sorted [1, 1) +-Node [1, 501) | +-Node [1, 249) | | +-Node [1, 169) | | | +-Leaf [1, 47) Unsorted | | | +-Sorted [47, 48) | | | +-Node [48, 169) | | | +-Leaf [48, 83) Unsorted | | | +-Sorted [83, 103) | | | +-Leaf [103, 169) Unsorted | | +-Sorted [169, 170) | | +-Leaf [170, 249) Unsorted | +-Sorted [249, 250) | +-Leaf [250, 501) Unsorted +-Sorted [501, 500) > > -- "Partitioned" would be a better word than "Unsorted". > -- Now, let's ask for another element, some distance away > > =getter(400) 7877 > getter"" Root +-Sorted [1, 1) +-Node [1, 501) | +-Node [1, 249) | | +-Node [1, 169) | | | +-Leaf [1, 47) Unsorted | | | +-Sorted [47, 48) | | | +-Node [48, 169) | | | +-Leaf [48, 83) Unsorted | | | +-Sorted [83, 103) | | | +-Leaf [103, 169) Unsorted | | +-Sorted [169, 170) | | +-Leaf [170, 249) Unsorted | +-Sorted [249, 250) | +-Node [250, 501) | +-Leaf [250, 382) Unsorted | +-Sorted [382, 383) | +-Node [383, 501) | +-Node [383, 441) | | +-Leaf [383, 391) Unsorted | | +-Sorted [391, 408) | | +-Leaf [408, 441) Unsorted | +-Sorted [441, 442) | +-Leaf [442, 501) Unsorted +-Sorted [501, 500) > > -- Leaf [250, 501) has now been expanded down to element 400. > > -- Now, let's watch as it merges tree nodes. > > =getter(387) 7546 > getter"" Root +-Sorted [1, 1) +-Node [1, 501) | +-Node [1, 249) | | +-Node [1, 169) | | | +-Leaf [1, 47) Unsorted | | | +-Sorted [47, 48) | | | +-Node [48, 169) | | | +-Leaf [48, 83) Unsorted | | | +-Sorted [83, 103) | | | +-Leaf [103, 169) Unsorted | | +-Sorted [169, 170) | | +-Leaf [170, 249) Unsorted | +-Sorted [249, 250) | +-Node [250, 501) | +-Leaf [250, 382) Unsorted | +-Sorted [382, 408) | +-Node [408, 501) | +-Leaf [408, 441) Unsorted | +-Sorted [441, 442) | +-Leaf [442, 501) Unsorted +-Sorted [501, 500) > > Leaf [383, 391] has been fully sorted, so it can be merged into it's parent, Node [383, 441). > That in turn creates another merge, leaving the tree as shown above.
[1] For a slightly more academic (but still accessible) discussion of this, read Lex Augusteijn's interesting paper on [Sorting Morphisms.]