【发布时间】:2019-02-20 02:31:10
【问题描述】:
我想从this link 中抓取数据,为此我在 R 中编写了以下代码。但是,这不起作用,只返回结果的第一页。显然,循环不起作用。有人知道循环有什么问题吗?
library('rvest')
for (i in 1:40) {
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))
rank_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(1)')
rank_data <- html_text(rank_data_html)
rank_data<-as.numeric(rank_data)
title_data_html <- html_nodes(webpage,'.censo_list font')
title_data <- html_text(title_data_html)
author_data_html <- html_nodes(webpage,'.censo_list+ td font')
author_data <- html_text(author_data_html)
country_data_html <- html_nodes(webpage,'.censo_list~ td:nth-child(4) font')
rcountry_data <- html_text(country_data_html)
year_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(5) font')
year_data <- html_text(year_data_html)
type_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(6) font')
type_data <- html_text(type_data_html)
}
censorship_df<-data.frame(Rank = rank_data, Title = title_data, Author = author_data, Country = rcountry_data, Type = type_data, Year = year_data)
write.table(censorship_df, file="sample.csv",sep=",",row.names=F)
【问题讨论】:
-
在its robots.txt 上禁止抓取该页面
-
您在循环的每次迭代中都重写了变量。
标签: r loops web-scraping