【发布时间】:2021-06-19 16:34:47
【问题描述】:
`library(rvest)
library(RCurl)
library(XML)
library(stringr)
#Getting the number of Page
getPageNumber <- function(URL){
parsedDocument = read_html(URL)
Sort1 <- html_nodes(parsedDocument, 'div')
Sort2 <- Sort1[which(html_attr(Sort1, "class") == "pagination al-pagination")]
P <- str_count(html_text(Sort2), pattern = " \\d+\r\n")
return(ifelse(length(P) == 0, 0, max(P)))
}
#Getting all articles based off of their DOI
getAllArticles <-function(URL){
parsedDocument = read_html(URL)
Sort1 <- html_nodes(parsedDocument,'div')
Sort2 <- Sort1[which(html_attr(Sort1, "class") == "al-citation-list")]
ArticleDOInumber = trimws(gsub(".*10.1093/dnares/","",html_text(Sort2)))
URL3 <- "https://doi.org/10.1093/dnares/"
URL4 <- paste(URL3, ArticleDOInumber, sep = "")
return(URL4)
}
Title <- function(parsedDocument){
Sort1 <- html_nodes(parsedDocument, 'h4')
Title <- gsub("<a>\\n|\\n</a>","",Sort1)
return(Title)
}
#main function with input as parameter year
findURL <- function(year_chosen){
if(year_chosen >= 1994){
noYearURL = glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
pagesURl = "&fl_SiteID=5275&page="
URL = paste(noYearURL, pagesURl, sep = "")
#URL is working with parameter year_chosen
Page <- getPageNumber(URL)
if(Page == 5){
Page2 <- 0
while(Page < Page2 | Page != Page2){
Page <- Page2
URL3 <- paste(URL, Page-1, sep = "")
Page2 <- getPageNumber(URL3)
}
}
R_Data <- data.frame()
for(i in 0:ifelse((Page-1) > 0, (Page-1), 0)){
URL2 <- getAllArticles(paste(URL, i, sep = ""))
for(j in 1:(length(URL2))){
parsedDocument <- read_html(URL2[j])
print(URL2[j])
R <- data.frame("Title" = Title(parsedDocument), stringsAsFactors = FALSE)
R_Data <- rbind(R_Data, R)
}
}
write.csv(R_Data, "Group4.csv", row.names = FALSE, sep = "\t")
} else {
print("The Year you provide is out of range, this journal only contain articles from 2005 to present")
}
}
findURL(2000)`
所以我正在尝试抓取给定年份的网站,并在我的主要功能中尝试循环浏览不同的页面。仅提取每篇文章的标题。
我不断收到此错误 -> open.connection(x, "rb") 中的错误:HTTP 错误 404
有些年份只有 3 页,所以我明白为什么可能会出现错误,但大多数期刊的文章都有 5 页。
按年刮掉期刊后,我想将刮掉的标题写到 civ 文件中。
提前感谢您的帮助!
【问题讨论】:
标签: r web-scraping data-mining