[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Reading CSV
- From: Tim Channon <tc@...>
- Date: Tue, 03 Dec 2013 22:05:30 +0000
On 03/12/2013 19:18, Geoff Leyland wrote:
> On 4/12/2013, at 7:44 am, Dirk Laurie <dirk.laurie@gmail.com> wrote:
>
>> 2013/12/3 Geoff Leyland <geoff_leyland@fastmail.fm>:
>>
>>> What’s the current best option for a CSV (or tab separated, for that matter) file?
>>
>> Define "best”.
>
> On 28/11/2013, at 7:59 pm, Dirk Laurie <dirk.laurie@gmail.com> wrote:
>
>> 1. It is actively being maintained.
>> 3. A certain amount of quality control is in effect.
>
> Covering the issues I mentioned would be handy too. Maybe “handles corner cases and large files”?
>
>> As far as I can tell, most solutions either:
>>> - read the whole file in one go (constructing a table of all the values becomes
>>> impractical as files get larger)
>>
>> Is this comment based on gut feel (I'm deliberately avoiding "prejudice"), or
>> do you have an actual example where this is a problem?
>
> Metadata for the set of NZ land parcels is over 3GB.
>
Quite, what is best.
If you do this often maybe find a binary library.
In most cases I do simple, use io.lines, if a separator is missing from
the end of line, add one and go from there
Out of curiosity, found a suitable file here, 157M, comma delimited, not
a particularly fast computer, writing a lines loop which increments a
line counter and prints count on exit 2.35m lines, 13s,
Straight open, read("*a") 5s and that hasn't split into lines.
So there isn't much to gain from block read in this case and this
machine. Will be many wrinkles including memory allocation and gc.
Do 3G in one hit and you consume 3G of machine memory.
Try it, only take a few seconds to write something.