在 R 中使用 rvest 包抓取 soundcloud.com答案

【问题标题】：Scraping soundcloud.com with rvest package in R在 R 中使用 rvest 包抓取 soundcloud.com
【发布时间】：2020-12-10 23:40:41
【问题描述】：

我正在尝试搜索这个URL 以获取加拿大前 50 名 soundcloud 艺术家的名字。

使用 SelectorGadget，我选择了艺术家的名字，它告诉我路径是“.sc-link-light”。

我的第一次尝试如下：

library(rvest)
library(stringr)
library(reshape2)

soundcloud <- read_html("https://soundcloud.com/charts/top?genre=all-music&country=CA")

artist_name <- soundcloud %>% html_nodes('.sc-link-light') %>% html_text()

它产生了艺术家名称为 0 的列表。

我第二次尝试将最后一行改为：

artist_name <- soundcloud %>% html_node(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", ".sc-link-light", " " ))]') %>% html_text()

这又产生了同样的结果。

我到底做错了什么？我相信这应该给我列表中的艺术家姓名。任何帮助表示赞赏，谢谢。

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

您尝试抓取的网页是动态的。因此，您将需要使用诸如 RSelenium 之类的库。示例脚本如下：

library(tidyverse)
library(RSelenium)
library(rvest)
library(stringr)

url <- "https://soundcloud.com/charts/top?genre=all-music&country=CA"

rD <- rsDriver(browser = "chrome")
remDr <- rD[["client"]]


remDr$navigate(url)
pg <- read_html(remDr$getPageSource()[[1]])
artist_name <- pg %>% html_nodes('.sc-link-light') %>% html_text()


####clean up####
remDr$close()
rD$server$stop()
rm(rD, remDr)
gc()

system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

【讨论】：

谢谢 Blue050205，我实际上正在研究 Rstudio 云，您可以想象，这并没有那么好。但是，我在 RSelenium 上做了一些阅读，我将其标记为已回答，因为我相信您的解决方案会奏效。我会将其保存为参考，以供将来在涉及动态网页的网络抓取项目中使用。感谢您的帮助。