使用 R（jsonlite 或 rvest）抓取 ssrn.com - 仅显示前 50 个结果答案

【问题标题】：Scrape ssrn.com with R (jsonlite or rvest) - only first 50 results shown使用 R（jsonlite 或 rvest）抓取 ssrn.com - 仅显示前 50 个结果
【发布时间】：2023-01-22 19:54:51
【问题描述】：

我正在尝试从 https://www.ssrn.com/index.cfm/en/arn/?page=1&sort=0（标题、作者、url 等）中抓取前 200 个条目。到目前为止，我使用了 rvest（直到这周，它在前 4 页上都运行良好），现在尝试直接从 https://api.ssrn.com/content/v1/bindings/204/papers 抓取 json。代码工作正常（见下文），但我不知道如何获得超过前 50 个条目，或者甚至显示超过 50 个条目（共 43602 个）。使用 jsonlite 或 rvest 的任何解决方案？

任何帮助表示赞赏！提前致谢。

library(jsonlite)
json_file <- "https://api.ssrn.com/content/v1/bindings/204/papers"
data <- fromJSON(json_file)
data <- as.data.frame(data)

【问题讨论】：

标签： r json web-scraping rvest jsonlite

【解决方案1】：

如果您查看链接，您可以根据 index 更改输出参数 count。每个索引的最大输出为 200，然后映射索引序列以获取所有 43602 个条目，如下所示（2-3 分钟的抓取时间）：

library(tidyverse) 
library(httr2)

get_ssrn <- function(index) {
  cat("Scraping index:", index, "
")
  str_c("https://api.ssrn.com/content/v1/bindings/204/papers?index=", 
        index, "&count=200&sort=0") %>%
    request() %>%
    req_perform() %>%
    resp_body_json(simplifyVector = TRUE) %>%
    pluck("papers") %>%
    as_tibble() 
}

df <- map_dfr(seq(0, 43602, by = 200), get_ssrn)

df

# A tibble: 43,602 × 13
   abstract_…¹ publi…² is_paid refer…³ page_…⁴ title authors affil…⁵     id is_ap…⁶ appro…⁷ downl…⁸
   <chr>       <chr>   <lgl>   <chr>     <int> <chr> <list>  <chr>    <int> <lgl>   <chr>     <int>
 1 Working Pa… UNDER … FALSE   ""           68 "Is … <df>    "Conco… 4.33e6 TRUE    20 Jan…      27
 2 Working Pa… UNDER … FALSE   ""           58 "The… <df>    "Unive… 4.33e6 TRUE    20 Jan…      14
 3 Working Pa… UNDER … FALSE   ""            7 "App… <df>    "Atma … 4.33e6 TRUE    20 Jan…       2
 4 Working Pa… UNDER … FALSE   ""            7 "The… <df>    "Atmaj… 4.33e6 TRUE    20 Jan…       2
 5 Working Pa… UNDER … FALSE   "Afric…       0 "Mer… <df>    "Indep… 4.33e6 TRUE    20 Jan…       0
 6 Working Pa… UNDER … FALSE   ""           22 "Siz… <df>    "Unive… 4.33e6 TRUE    20 Jan…       2
 7 Accepted P… UNDER … FALSE   "Finan…       0 "Bud… <df>    "Norwe… 4.33e6 TRUE    20 Jan…       0
 8 Working Pa… UNDER … FALSE   "Journ…       6 "Fac… <df>    "Open … 4.33e6 TRUE    20 Jan…       2
 9 Working Pa… UNDER … FALSE   ""           34 "Soc… <df>    "Unive… 4.33e6 TRUE    20 Jan…       1
10 Working Pa… UNDER … FALSE   "Manag…       0 "Aud… <df>    "Chu H… 4.33e6 TRUE    20 Jan…       0
# … with 43,592 more rows, 1 more variable: url <chr>, and abbreviated variable names
#   ¹abstract_type, ²publication_status, ³reference, ⁴page_count, ⁵affiliations, ⁶is_approved,
#   ⁷approved_date, ⁸downloads

【讨论】：