Index Interval Format Issue |
|
The source of the whole point lies in the history of programming. An implementation detail of some languages such as C is that array items are referenced through a pointer + offset address. Thus, the offset exists at the implementation level, which leads to a choice for the language designer to either transfer that scheme into the language itself, so that the programmer has to adapt to it, or let the compiler or interpreter subtract one from all indexes so that the (often human) programmer can use ordinals instead. This choice was not the one for C, and as a consequence for the whole C/C++/Java/Python/Ruby family. As C has been highly successful, language designers who want their 'baby' be widely used have to take this de facto standard into account -- in addition to the fact that they also are C programmers. Now, stating that a feature is "natural" because it is a standard simply is a logical error.
Arrays, lists, sequences, and tables are ordered sets. Humans use ordinal numbers--first, second, third (or #1, #2, #3)--for pointing an element in a set (which one?). In contrast, they use cardinal numbers--1, 2, 3--to count the number of elements (how many?). Unfortunately enough, programming is massively English speaking, and this language calls both number types "numbers" (compare with Nummer/Zahl in German or numéro/nombre in French). Still all languages make a big difference between the two. In all fields of human life, except for a sub-field of programming, ordinals are used for pointing things. Look for year 0, the 0th book on a shelve, or Baker Street, #0 ;-). Moreover, ordinals seem to have appeared first, which is the reason why "zero", a cardinal, is so young in human history. Languages witness this fact: people say "nobody", "never", and "nothing" instead of using the word "zero". All of this is to say that at the human level, the programmer-friendly choice is to use ordinals, not offsets, to reference items in an ordered set.
a[(i - 1)%n + 1]
. The first hour of the day is 12 and the last hour is 11 (it seems they had the presence of mind to make 1 o'clock start one hour after midnight, but then got all tied up in knots when it came to midnight itself). Lists of ten always have one two-digit number; this thoroughly confuses children learning to count and read to 10 and confuses them even more when they have to learn about zero. The first kilometer of a trip starts at zero kilometers. —MarceloCantos?
Similarly, closed intervals happen to be more natural or intuitive. In everyday life, "from the first to the third" has an inclusive meaning in any case. Using half-open intervals thus first demands a mental work which eventually becomes automatic. This neither means that such a syntax is less efficient, nor less rational, but simply less human-friendly for non-initiates. Some programmers seem to be proud of such esoteric features that lead newbies to errors ;-).
Here it is necessary to distinguish process and human semantic levels. On the process level, both schemes are equally functional, and both semantics are consistent. On the human level, the 0-base/half-open scheme needs a kind of 'transcoding' from the process layer. As a consequence, if one agrees that source code (as well as mathematical expressions for instance) is to be read by humans first, rather than by machines, then this scheme is a lesser option -- one should choose it only if the alternative proved to be impossible for any reason. Note that Pascal, while as old as C, has taken the opposite way -- not a coincidence, as it has been designed to be suitable for education.
In everyday work, once one is used to one or the other convention, both schemes prove to be usable. When programmers discuss about that issue, they tend to bitterly hold to their personal comfort, that is their mental habits built through everyday use. As mostly used languages follow the C convention, there are many arguments supporting it. Now, when carefully watched, these words happen to be wrong as logical reasons: behind rational expression, they are opinions.
For instance, the argument that a 0-length range is best expressed with [n,n)
is more or less meaningless; the only sensible way to express an empty sequence is []
. Both [5,5)
and [5,4]
simply are semantically absurd. Now, it may happen at run time that a sequence slice is empty -- this is another level: the programmer does not cope with it at design time.
[]
to express the empty range is awkward. A computer program will typically use a data structure with two end-points: struct range { int start, end; };
. The code to intersect two such ranges is very simple: range intersect(range a, range b) { range r = { max(a.start, b.start); min(a.end, b.end); if (r.end < r.start) r.end = r.start; return r;
}, as is the cardinality: int count(range r) { return r.end - r.start;
}, and the empty function: bool empty(range r) { return r.start == r.end;
}. As soon as you require a different representation for empty ranges, everything suddenly gets more complicated and consumes more RAM and CPU. You'll also find that you have to liberally sprinkle your code with + 1
and - 1
to correct for various off-by-one issues (e.g., the count now requires r.end - r.start + 1
as well as explicitly testing for the empty state). —MarceloCantos?
end - start
, regardless of whether the range is continuous or discrete, whereas the formula for the "length" of a closed range is different for continuous and discrete ranges. As a practical and quite common example, the number of days in a half-open date range is computed the same way regardless of whether the end-points are dates or timestamps. This is why I tend to prefer date1 <= d AND d < date2
over SQL's BETWEEN
operator; I never have to think about whether the variables in question are dates or timestamps, or worry about whether a maintainer will ever switch the types from one to the other. —MarceloCantos?
[5,5)
is a degenerate case, and such cases are often very useful. What, exactly, makes it absurd? —MarceloCantos?
In the 21st century, should such losses of time and energy still happen? There are at least two paths to get out of this problem.
As some note, BASIC allows explicit expression of the array index scheme. For instance, array(0,10)
would define a zero-based range of indexes. In a similar point of view, one could explicitly use half-open intervals using either of the common mathematical syntaxes for that: [a,b)
or [a,b[
.
This option has the additional advantage of getting rid of error-prone format, as the scheme used is explicitly written. All of this is good for the human reader, yet the problem remains to properly interpret expressions written in a convention one is not used to. The following proposal addresses this issue.
As programmers all of us are familiar with nice editor customization features, like indent options that allow us to say if we prefer tabs or spaces or how wide an indent should be. Note that this works both for loading/reading and editing/saving source code files. Whatever the author's options, we are able to read and edit the code with our own preferred convention. Whatever our options, another developer will be able to read and edit the same code using his own choices. Now, the file may be saved according to any standard norm, no matter. This is a kind of foreground/background distinction performed through an editor customization layer. Great!
Now, why not extend this principle to *any* language feature? For instance, one may prefer to get rid of the '=
' for assignments (which is semantically wrong), replace it with e.g. ':
', and use '=
' only for logical equality instead. When loading a file, the editor should display the code using these preferences, whatever the standard used for saving.
The same is worth for indexes and slices: a C programmer would set 0-base/half-open slices and get the code displayed that way, whatever the background saving standard may be. A programmer used to Delphi would choose the opposite scheme, no matter.
[What do you think of that? I mainly program in Python these days, and most of the IDEs for this language are written in Python/wxPython, too. So that when I have time for it, I aim to implement this feature into an editor...]
a[len - 1 - i]
to 1-based indexing? Would it simply wrap the index thus: a[(len - 1 - i) + 1]
, or would it try to parse the content, looking for a "- 1
" to remove or a "+/- <constant>
" to alter, thus: a[len - i]
? What if the code were unfinished: a[len - ]
or simply perverse: a[~i & 0xff] /* len == 256 */
? And how would you deal with C++, which lets you overload operator []
? Also consider that programmers frequently share a screen while discussing code, which would lead to horrendous confusion. —MarceloCantos?