使用 R 从 aspx 网站抓取答案

【问题标题】：Scraping from aspx website using R使用 R 从 aspx 网站抓取
【发布时间】：2013-05-30 01:09:26
【问题描述】：

我正在尝试使用 R 在网站上抓取数据来完成一项任务。

我想浏览以下页面上的每个链接： http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House比尔
仅选择当前状态显示“已传输到州长”的项目。例如，http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013
然后报废 STATUS TEXT 中的单元格以用于以下子句“通过最终阅读”。例如：通过了 SD 2 中修正的终读，代表 Fale、Jordan、Tsuji 投了赞成票，但有保留；代表 Cabanilla、Morikawa、Oshiro、Tokioka 投反对票 (4) 并且没有人免责 (0)。

我曾尝试将先前的示例与包 Rcurl 和 XML（在 R 中）一起使用，但我不知道如何将它们正确用于 aspx 站点。所以我想要的是： 1. 关于如何构建这样的代码的一些建议。 2. 以及如何学习执行此类任务所需知识的建议。

感谢您的帮助，

汤姆

【问题讨论】：

我建议您在这里查看我试图学习抓取网站的这个线程。 talkstats.com/showthread.php/…
我在这上面花了几个小时，这并不容易 :( 你可以获取第一页的内容，但第二页不接受我传入 __VIEWSTATE 和其他一些参数as shown here。我可以到达resp<-GET( "http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House Bills");writeBin(content(resp,'raw'),tf);readHTMLTable(tf)$GridViewReports，但第二个站点杀死了它:(

标签： r web-scraping

【解决方案1】：

require(httr)
require(XML)

basePage <- "http://capitol.hawaii.gov"

h <- handle(basePage)

GET(handle = h)

res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House")

# parse content for "Transmitted to Governor" text
resXML <- htmlParse(content(res, as = "text"))
resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]')
appRows <-sapply(resTable, xmlValue)
include <- grepl("Transmitted to Governor", appRows)
resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href')

appUrls <- resUrls[include]

# look at just the first

res <- GET(handle = h, path = appUrls[1])

resXML <- htmlParse(content(res, as = "text"))


xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)

[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan,
 Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro,
 Tokioka voting no (4) and none excused (0)."

通过设置handle，让包httr处理所有后台工作。

如果你想遍历所有 92 个链接：

 # get all the links returned as a list (will take sometime)
 # print statement included for sanity
 res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x)));
                                   GET(handle = h, path = x)})
 resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))})
 appString <- sapply(resXML, function(x){
                   xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)
                      })


 head(appString)

>  head(appString)
$href
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                                                  
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                                 
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  1 Excused: Ige."                    
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                        
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."  
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)."

【讨论】：

谢谢你，user1609452。这是我了解如何抓取 aspx 页面的一个很好的起点。
对不起，user1609452。是否可以一次列出所有相关 URL 而不是 1 个？再次感谢！
谢谢！！完美运行！