[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: help with regexp
- From: Walter Cruz <walter.php@...>
- Date: Wed, 19 Oct 2005 12:03:58 -0300
Hi all.
In the site http://www2.camara.gov.br/glossario/ there's a lot of definitions that I want to grab to a txt file.
I've instaled luasockets today, and I thougt: "Why not make this on lua?"
The keywords in the HTML sources are like this:
H4 class=sessaoPagina>Abertura de crédito adicional</H4>
And the definitions like this:
<TD>some word here</TD>
Well, I have found that my patterns were on some cases matching more
definitions than keywords or more keywords than definitions.
I've made a little test - a lua script that shows the size of keywords and definitions on each page.
The results are( the first column being the letter, the second the number of keywords and the third the number of definitions).
a - 60 - 60
b - 11 - 11
c - 101 - 96
d - 58 - 57
e - 64 - 64
f - 10 - 9
g - 8 - 8
h - 1 - 1
i - 36 - 35
j - 4 - 4
l - 32 - 33
m - 20 - 19
n - 15 - 15
o - 40 - 40
p - 105 - 103
q - 9 - 9
r - 73 - 70
s - 58 - 56
t - 18 - 18
u - 10 - 10
v - 16 - 16
z - 1 - 1
The source is:
____________
http = require("socket.http")
--print(http)
--table.foreach(http, print)
local letters={'a','b','c','d','e','f','g','h','i','j','l','m','n','o','p','q','r','s','t','u','v','z'}
for x, element in letters do
local words={}
local definitions={}
local page = http.request("http://www2.camara.gov.br/glossario/" .. element ..".html")
for w in string.gfind(page, "<H4 class=sessaoPagina>(.[^H4]+)<\/H4>") do
table.insert(words,w)
end
for q in string.gfind(page, "<TD>(.[^H4]+)</TD>") do
table.insert(definitions,q)
end
print(element .." - " .. table.getn(words) .. " - " .. table.getn(definitions))
end
__________________________________
Well, if someone can help me to see where is the error, or point me how to make a better regexp, I will be very grateful.
(sorry for the very poor english :P)
[]'s
- Walter