[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Homemade email system using LuaSocket and LuaPOP3
- From: Sean Conner <sean@...>
- Date: Tue, 20 Sep 2011 20:08:20 -0400
It was thus said that the Great clemens fischer once stated:
> Sean Conner wrote:
>
> > ...
> > Now, with that out of the way, a decent method of storing emails is one
> > per file, and there's even a semi-standard for that [3]. My preference is
> > to take the Message-ID (if it doesn't exist, generate one), take a hash
> > (SHA1, MD5, pick your favorite) and use that result as the basis for the
> > directory/filename. I also store two versions of the headers and the body
> > as separate files. For example:
> >
> > Message-ID: <d1b.3e6bbea3.37310e87@aol.com>
> >
> > This (I include the brackets since it's part of the message id) hashes to
> > (I use MD5 since it was handy):
> >
> > fff6c8c5b7ae790d732d6cf50b8a5ff6
>
> According to RFC-5322 a "Message-ID" contains the string including the
> angle brackets. I say that because here (in bash):
>
> $ md5sum <<< '<d1b.3e6bbea3.37310e87@aol.com>'
> 96a8888ea961b869c85919526e0ac48b -
Actually, looking over the code (and it's a mess, what with a dozen
half-finished different versions) I'm not sure what I was exactly using for
the hash, but I do know it was consistent at least.
> Including the header name "Message-ID:" and any possibly folding white
> space will pose a problem when looking up ID's mentioned in
> "References:" or "In-Reply-To:" headers.
The actual message ID appears in angle brackets---anything else is *not*
the message ID (but the header can contain other stuff, not usually, but the
older the email, the more likely it'll be ... um ... interesting).
> > I then break the hash up into three components:
> >
> > fff6 c8c5 b7ae790d732d6cf50b8a5ff6
I found a later version that broke the hash up thusly:
fff 6c8 c5b7ae790d732d6cf50b8a5ff6
I did that because the former (with four hex-digits) could lead to
directories with up to 65,536 entries, whereas with the later (three
hex-digits) you would end up with directories with only (only!) 4,096
entries, a figure I find much more managable.
> The "folders" (usenet newsgroups) contain hard-links providing the
> mapping between articles and possibly several newsgroups an article may
> have been crossposted to.
>
> This solves the "an email message can be in multiple "folders" while
> maintaining a single copy" problem while no separate database (your text
> file index) is needed. It is way simpler to make tools handling links
> than to keep a database.
But without a separate database (my text file), there is no way of knowing
which "folders" an individual message resides in. For instance, one can
delete a message from a folder, or one can delete a message form all
folders (and thus, remove it entirely).
> > ...
> > [4] Except for the header parsing---for that I use C code, and I'm still
> > working on that.
>
> Good luck with [4]. I have frequently tried to catch up with all the
> variations of headers in "conforming" emails/articles and, of course,
> the spammy ones. In addition to the ones you mention, here are some
> tools doing MIME parsing:
I'm close---I just need to finish doing some rewriting as I changed
directions in the actual parsing (first draft---everything had to be in
memory. That doesn't work well for handing email as it comes in over the
network, so I needed to handle a stream-based interface, but I didn't want
to lose the ability to handle a memory-mapped email---much rewriting
ensued), as well as further clarification of the various headers (and just
how messed up they can be).
> http://www.ivarch.com/programs/qsf/
> http://bogofilter.sourceforge.net/
>
> They are possibly more "real world" than the strict ones you refer to,
> especially DJB's mess822.
I have personal email going back to 1993; I have even older emails going
back to the mid-80s (from archives), so I have plenty of "real world"
examples to go by.
-spc (also, I found this bug in Lua http://www.lua.org/bugs.html#5.1.4-6
due to my email project).