解析用 rvest 抓取的 Google Scholar 搜索结果答案

【问题标题】：parse Google Scholar search results scraped with rvest解析用 rvest 抓取的 Google Scholar 搜索结果
【发布时间】：2020-06-16 08:48:45
【问题描述】：

我正在尝试使用 rvest 将 Google Scholar 搜索结果的一页抓取到作者、论文标题、年份和期刊标题的数据框中。

以下简化的、可重现的示例是在 Google Scholar 中搜索示例术语“顶级捕食者保护”的代码。

注意：为了遵守服务条款，我只想处理通过手动搜索获得的搜索结果的第一页。我不是在问关于自动抓取其他页面的问题。

下面的代码已经可以提取了：

作者
论文题目
年

但它没有：

期刊名称

我想提取期刊标题并将其添加到输出中。

library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)

url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)

df

来源：https://stackoverflow.com/a/58192323/8742237

所以该代码的输出如下所示：

#>                                                                                                                                                   titles
#> 1                                                                                    [HTML][HTML] Saving large carnivores, but losing the apex predator?
#> 2                               Site fidelity and sex-specific migration in a mobile apex predator: implications for conservation and ecosystem dynamics
#> 3                  Effects of tourism-related provisioning on the trophic signatures and movement patterns of an apex predator, the Caribbean reef shark

#>                                           authors years
#> 1                  A Ordiz, R Bischof, JE Swenson  2013
#> 2  A Barnett, KG Abrantes, JD Stevens, JM Semmens  2011

两个问题：

如何添加从原始数据中提取期刊标题的列？
是否有参考资料可供我阅读并了解有关如何为自己提取其他字段的更多信息，因此我不必在这里询问？

【问题讨论】：

标签： html r rvest stringr xml2

【解决方案1】：

添加它们的一种方法是：

library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)

url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)


leftovers <- authors_years %>% 
  str_remove_all(authors) %>% 
  str_remove_all(years)


journals <- str_split(leftovers, "-") %>% 
            map_chr(2) %>% 
            str_extract_all("[:alpha:]*") %>% 
            map(function(x) x[x != ""]) %>% 
            map(~paste(., collapse = " ")) %>% 
            unlist()

# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, journals = journals, stringsAsFactors = FALSE)

对于您的第二个问题：css selector gadget chrome extension 非常适合获取所需元素的 css 选择器。但是在您的情况下，所有元素都共享相同的 css 类，因此解开它们的唯一方法是使用正则表达式。所以我想了解一下css选择器和regex :)

【讨论】：