[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Any html scraping libraries?
- From: Michal Kottman <k0mpjut0r@...>
- Date: Sat, 09 Apr 2011 11:27:25 +0200
On Sat, 2011-04-09 at 00:35 +0400, Alexander Gladysh wrote:
> Hi, list!
>
> I'm looking for a Lua module to scrape some data from a (possibly
> broken) HTML page.
>
> Any usable ones out there?
Not sure it's very usable, and definitely not a module (more like a
tool, exports everything to _G), but I wrote wdm [1] with web scraping
in mind. It's a mash-up of lua-curl + luasql. To be more flexible to
"possibly broken" pages, it uses the Html Tidy library to parse the Html
pages.
You can either process the page source using Lua string processing
functions, or you can create a table, either using Roberto's XML parser,
or with Html Tidy. There is also a simple 'quasi-XPath' to select
elements using a predicate.
It uses a cache (optionally compressed, if you have the libbz2 binding),
so it doesn't do any unnecessary request multiple times. This means that
you can run and debug your processing code many times without "abusing"
the server.
I was thinking of adding a filtering/processing interface, something
like Yahoo Pipes [2], but never got around to actually implementing
it...
[1] https://github.com/mkottman/wdm
[2] http://pipes.yahoo.com/pipes/