R中的网页抓取？答案

【问题标题】：Web scraping in R?R中的网页抓取？
【发布时间】：2017-07-26 08:43:10
【问题描述】：

我想抓取this网站

我特别想获取该表中的信息：

请注意，我在右上角选择了一个特定的日期。

通过关注this 指南

我写了以下代码

library(rvest)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'

webpage_nba <- read_html(url_nba)

#Using CSS selectors to scrap the rankings section
data_nba <- html_nodes(webpage_nba,'#standings-table')

#Converting the ranking data to text
data_nba <- html_text(data_nba)
write.csv(data_nba,"web scraping test.csv")

根据我的理解，我想要获得的数字（例如，对于勇士队，它将是 94%、79%、66%、59%）以不同的方式“编码”。换句话说，web scraping test.csv 中写的内容是不可读的。

有什么方法可以将“编码数字”转换为“常规数字”？

【问题讨论】：

首先，您可以使用html_table(webpage_nba) 从 html 中提取所有表的列表 - 如果您对 html 表感兴趣，这是一个非常方便的功能。但是您确定您的代码实际上完全提取了表格吗？我会怀疑，因为我在这里看到很多 javascript，这并不意味着网络抓取的好东西，例如您的选择未反映在 html 源代码中。您签出 github.com/fivethirtyeight/data/tree/master/nba-elo 了吗？我不是nba的孩子，但也许你可以在那里找到数据？
确实 html_table(webpage_nba) 会给我桌子。但随后出现 2 个问题：1）在第三张表的列中（运行此命令后）有 <U+2713> 而不是普通数字。我怎么能“翻译”它们？ 2）我怎么能从右上角选择某个日期（4月14日季后赛前）。 NBA，只是证明我观点的一个例子
我明白了。正常数字不存在，因为它在您选择之前读取“空”表。快速谷歌搜索显示确实是第一行的复选标记。我会尝试在您选择的网站上进行选择，然后转到源代码（右键单击“查看页面源代码”）并快速执行 strg+f 以获取您在视觉上看到的内容（例如 94%）。如果你找不到它，你就不能轻易地刮掉它，你需要在谷歌上寻找'scrape javascript generated data in R'。我认为没有现成的解决方案，您需要针对您的具体案例进行一些挖掘并尝试一下。
与最后一点类似，您可以在筛选所需日期后找到 94%。它与复选标记位于同一位置； (webpage_nba %>% html_nodes(".pct.div.break") . 关键是 url 检索到最后日期，因此不对应过滤。有一些相关问题处理'search results scraping with R'

标签： html r web-scraping rvest

【解决方案1】：

我尝试使用rvest 解析数据，但似乎这里具有挑战性的问题是单击下拉菜单，由HTML 结构中的<select> 标记表示。所以我装备了重型火炮-RSelenium，这是浏览器模拟器。多亏了 SO 上的answer，使用它一切变得容易：

library(RSelenium)
library(rvest)

url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'


#initiate RSelenium. If it doesn't work, try other browser engines
rD <- rsDriver(port=4444L,browser="firefox")
remDr <- rD$client

#navigate to main page
remDr$navigate(url_nba)

#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()

# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
rD[["server"]]$stop() 

# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]

# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]

df

    ELO Carm-ELO 1-Week Change          Team Conf. Conf. Semis Conf. Finals Finals Win Title
4  1770     1792           -14      Warriors  West         94%          79%    66%       59%
5  1661     1660           -43         Spurs  West         90%          62%    15%       11%
6  1600     1603           +33       Raptors  East         77%          47%    25%        5%
7  1636     1640           +33      Clippers  West         58%          11%     7%        5%
8  1587     1589           -22       Celtics  East         70%          42%    24%        4%
9  1587     1584            -9       Wizards  East         79%          38%    21%        4%
10 1617     1609           +16          Jazz  West         42%           7%     5%        3%
11 1602     1606           -18       Rockets  West         70%          27%     5%        3%
12 1545     1541           -22     Cavaliers  East         59%          27%    11%        2%
13 1519     1523           +25         Bulls  East         30%          15%     7%       <1%
14 1526     1520           +37        Pacers  East         41%          17%     6%       <1%
15 1563     1564            +6 Trail Blazers  West          6%           3%     1%       <1%
16 1543     1537           -20       Thunder  West         30%           8%    <1%       <1%
17 1502     1502            -3         Bucks  East         23%           9%     3%       <1%
18 1479     1469           +46         Hawks  East         21%           6%     2%       <1%
19 1482     1480           -41     Grizzlies  West         10%           3%    <1%       <1%
20 1569     1555           +32          Heat  East           —            —      —         —
21 1552     1533           +27       Nuggets  West           —            —      —         —
22 1482     1489           -12      Pelicans  West           —            —      —         —
23 1463     1472           -18  Timberwolves  West           —            —      —         —
24 1463     1462           -40       Hornets  East           —            —      —         —
25 1441     1436           +22       Pistons  East           —            —      —         —
26 1420     1421           -20     Mavericks  West           —            —      —         —
27 1393     1395            -2         Kings  West           —            —      —         —
28 1374     1379           -13        Knicks  East           —            —      —         —
29 1367     1370           +47        Lakers  West           —            —      —         —
30 1372     1370           -14          Nets  East           —            —      —         —
31 1352     1355            -9         Magic  East           —            —      —         —
32 1338     1348           -29         76ers  East           —            —      —         —
33 1340     1337           +26          Suns  West           —            —      —         —

如果要解析其他时间段，请使用浏览器的开发工具检查页面 HTML 中的选项值。

【讨论】：

我尝试了browser=firefox 和browser=chrome 但在这两种情况下我都收到错误[1] "Connecting to remote server" Error in checkError(res) : Couldnt connect to host on http://localhost:4444/wd/hub. Please ensure a Selenium server is running.
@quant 您是否正确安装了RSelenium 及其所有依赖项？尝试执行RSelenium::rsDriver() 和wdman::selenium(port = 4444L)。它可以正常工作吗？`
RSelenium::rsDriver() 给出了同样的错误。我也尝试重新安装软件包。我得到了同样的错误
webElem <- remDr$findElement(using = 'xpath', value = '//*[(@id = "arrow-left")]') 仅供参考：您可以使用 SelectorGadget 获取元素的 xpath 并将其粘贴到您的代码中以使用您想要的任何页面元素。
ahh ezpz:) 只需执行两次webElem$clickElement()

【解决方案2】：

感谢@Alexey 的回答和this，以下代码对我有用

library(RSelenium)
library(rvest)
library(wdman)

url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'


#initiate RSelenium. If it doesn't work, try other browser engines
# rD <- rsDriver()
# remDr <- rD$client

pDrv <- phantomjs(port = 4567L)
remDr <- remoteDriver(browserName = "phantomjs", port = 4567L)
remDr$open()
#navigate to main page
remDr$navigate(url_nba)

#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()

# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
pDrv$stop()

# rD[["server"]]$stop() 


# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]

# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]
df

【讨论】：