[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Web crawling in Lua
- From: Michal Kottman <k0mpjut0r@...>
- Date: Sun, 7 Aug 2011 19:04:46 +0200
On Sunday, 7 August 2011, David Hollander <dhllndr@gmail.com> wrote:
>> I use them both in my little web-crawling utility module WDM [1]
>
> I see you are using Roberto's XML parser as a base, which is a strict
> parser that raises errors on improperly formatted XML?
> A problem I ran into last week is that the HTML spec is a bit
> different than XML[1], unless the webpage is specifically using an
> XHTML doctype, and many websites had html errors on top of that.
To deal with that issue, you can optionally use the html-tidy binding through the toTidy() function. It returns the same table format as toXml(), and also tries to clean up the source through htmltody beforehand. The source is at https://github.com/mkottman/tidy/tree/mk in the 'mk' branch.
WDM stores saved pages locally in a cache directory, so you can experiment without downloading things multiple times. These can be compressed if the bz2 library is available. You can find it at https://github.com/mkottman/lua-bz2 .