【发布时间】:2020-09-29 02:38:57
【问题描述】:
我正在尝试创建一个函数,该函数将从 this webpage 中提取搜索结果,但由于我希望该函数自动搜索多个基础,但我不知道如何修改该函数以便它循环遍历“mydf”中的所有行,或者使用其中一个 apply() 函数来获得相同的循环效果,以便为“mydf”中的每一行抓取结果。
当我在“mydf”的单行上运行下面的函数时,结果是正确的,但是当我没有指定特定行时,我收到以下错误:Error in parse_url(url) : length(url) == 1 is not TRUE
示例数据帧:
# construct sample data frame with Name, City, State:
name <- c("johnny carson foundation", "melinda gates foundation", "macarthur foundation")
city <- c("", "", "")
state <- c("", "", "")
mydf <- data.frame(name, city, state)
#replace spaces between words with '+' for consistent formatting of 'url' object:
mydf$name <- str_replace_all(mydf$name, " ", "+")
mydf$city <- str_replace_all(mydf$city, " ", "+")
以及我目前对该功能的尝试:
get_data <- function(df) {
# root components of url:
root <- "http://apps.irs.gov/app/eos/allSearch.do?ein1=&names="
root2 <- "&resultsPerPage=25&indexOfFirstRow=0&dispatchMethod=searchAll&city="
root3 <- "&country=US&postDateFrom=&postDateTo=&exemptTypeCode=al&deductibility=all&sortColumn=orgName&isDescending=false&submitName=Search"
# construct url by adding roots and search strings from 'mydf':
url <- paste(root, mydf$name, root2, mydf$city, '&state=', mydf$state, root3, sep = "")
gt <- GET(url)
content2 <- content(gt)
parsedHtml <- htmlParse(content2, asText = TRUE)
# Scraped results to be populated into 'mydf':
mydf$result_org <- ifelse(str_starts(xpathSApply(parsedHtml, "//div[@class='row results-body-row']", xmlValue, trim = TRUE),
"Your search did not return any results"), NA,
xpathSApply(parsedHtml, "//h3[@class='result-orgname']", xmlValue, trim = TRUE)) # Name
mydf$result_ein <- ifelse(str_starts(xpathSApply(parsedHtml, "//div[@class='row results-body-row']", xmlValue, trim = TRUE),
"Your search did not return any results"), NA,
xpathSApply(parsedHtml, "/html/body/div[3]/div[13]/div/div/div[1]/div[2]/div/ul/li/div[1]/span[1]", xmlValue)) # EIN
mydf$result_city <- ifelse(str_starts(xpathSApply(parsedHtml, "//div[@class='row results-body-row']", xmlValue, trim = TRUE),
"Your search did not return any results"), NA,
xpathSApply(parsedHtml, "/html/body/div[3]/div[13]/div/div/div[1]/div[2]/div/ul/li/div[1]/span[2]", xmlValue)) # City
mydf$result_state <- ifelse(str_starts(xpathSApply(parsedHtml, "//div[@class='row results-body-row']", xmlValue, trim = TRUE),
"Your search did not return any results"), NA,
xpathSApply(parsedHtml, "/html/body/div[3]/div[13]/div/div/div[1]/div[2]/div/ul/li/div[1]/span[3]", xmlValue, trim = TRUE)) # State
mydf$result_country <- ifelse(str_starts(xpathSApply(parsedHtml, "//div[@class='row results-body-row']", xmlValue, trim = TRUE),
"Your search did not return any results"), NA,
xpathSApply(parsedHtml, "/html/body/div[3]/div[13]/div/div/div[1]/div[2]/div/ul/li/div[1]/span[4]", xmlValue)) # Country
}
get_data(mydf)
mydf
非常感谢我的混乱和不优雅的代码!
【问题讨论】:
标签: r xml web-scraping httr