Re: Web crawling in Lua

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Web crawling in Lua
From: David Hollander <dhllndr@...>
Date: Sun, 7 Aug 2011 14:52:12 -0500

Hmm If there are no close tags in entire page it would list them in top Dom. To reconstruct that I'd need a table check of elements whose parent must be X (would work for nonclosed table rows/cells too which could be a more common instance of this mistake) or do reverse behavior and have table of elements that must be empty. I didn't need such error correction at the time but it could still be done in one pass if a rule check about HTML spec added

Sent from my iPhone

On Aug 7, 2011, at 11:06 AM, HyperHacker <hyperhacker@gmail.com> wrote:

> On Sun, Aug 7, 2011 at 09:55, David Hollander <dhllndr@gmail.com> wrote:
>> Cool! My solution was just to simplify Roberto's Lua function, by
>> putting all generated nodes on a single stack, and defer nesting nodes
>> as children until a </close> tag appears, and then iterate from the
>> top of the stack to find the next matching </close> tag. So
>> 
>> <div>
>>    <br>
>>   <img class=world src = "hello">
>>   <span id='stuff''>
>>    <div></div>
>>    <input type=checkbox checked>
>> </div>
>> 
>> ..would be valid and put all elements as children of the first <div>.
>> The debate then is if the inner <div> and <input> should instead be
>> children of the unfinished <span>. My current interpretation is that
>> they should not, though I'm not sure which error is more common. Would
>> need to find more poorly programmed websites, googling for *.aspx
>> might do the trick ;)
>> 
>> On Sun, Aug 7, 2011 at 9:13 AM, Michal Kottman <k0mpjut0r@gmail.com> wrote:
>>> On Sunday, 7 August 2011, David Hollander <dhllndr@gmail.com> wrote:
>>>>> I use them both in my little web-crawling utility module WDM [1]
>>>> 
>>>> I see you are using Roberto's XML parser as a base, which is a strict
>>>> parser that raises errors on improperly formatted XML?
>>>> A problem I ran into last week is that the HTML spec is a bit
>>>> different than XML[1], unless the webpage is specifically using an
>>>> XHTML doctype, and many websites had html errors on top of that.
>>> 
>>> To deal with that issue, you can optionally use the html-tidy binding
>>> through the toTidy() function. It returns the same table format as toXml(),
>>> and also tries to clean up the source through htmltody beforehand. The
>>> source is at https://github.com/mkottman/tidy/tree/mk in the 'mk' branch.
>>> 
>>> WDM stores saved pages locally in a cache directory, so you can experiment
>>> without downloading things multiple times. These can be compressed if the
>>> bz2 library is available. You can find it at
>>> https://github.com/mkottman/lua-bz2 .
>>> 
>> 
>> 
> 
> What would it do if it never found a close tag? Say: <html><body>Hello world!
> 
> -- 
> Sent from my toaster.
>

Follow-Ups:
- Re: Web crawling in Lua, Justin Cormack

References:
- Re: Web crawling in Lua, David Hollander
- Re: Web crawling in Lua, Michal Kottman
- Re: Web crawling in Lua, David Hollander
- Re: Web crawling in Lua, HyperHacker

Prev by Date: Re: Web crawling in Lua
Next by Date: Re: Web crawling in Lua
Previous by thread: Re: Web crawling in Lua
Next by thread: Re: Web crawling in Lua
Index(es):
- Date
- Thread