使用 R 抓取 HTML 网页答案

【问题标题】：Scraping HTML webpage using R使用 R 抓取 HTML 网页
【发布时间】：2018-09-29 10:33:39
【问题描述】：

我正在搜索 JFK 的网站以获取航班时刻表。航班时刻表的链接在这里；

http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures

首先，我正在检查任何给定航班的其中一个字段并记下它的 xpath。想法是查看输出，然后从那里开发代码。这是我目前所拥有的：

library(rvest)

Departure_url <- read_html('http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures')

Departures <- Departure_url %>% html_nodes(xpath = '//*[@id="ffAlLbl"]') %>% html_text()

我在上面的代码中得到一个空字符对象作为“出发”对象的输出。

我不确定为什么会发生这种情况。我正在寻找一个可以下载整个时间表的节点。

感谢任何帮助！

【问题讨论】：

标签： html web-scraping html-table nodes rvest

【解决方案1】：

刮掉那张桌子有点棘手。

首先，您尝试抓取的是实时内容。所以你需要一个无头浏览器，比如 RSelenium。

其次，内容实际上是在另一个 iframe 中的 iframe 中，因此您需要使用 switch to frame 两次。

最后，内容不是表格，所以需要获取所有向量，并组合成表格。

下面的代码应该可以完成这项工作：

library(RSelenium)
library(rvest)
library(stringr)
library(glue)
library(tidyverse)


#Rselenium
rmDr <- rsDriver(browser = "chrome")
myclient <- rmDr$client
myclient$navigate("http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures")
#Switch two frame twice
webElems <- myclient$findElement(using = "css",value = "[name=webfidsBox]")
myclient$switchToFrame(webElems)
webElems <- myclient$findElement(using = "css",value = "#coif02")
myclient$switchToFrame(webElems)

#get page souce of the content
myPagesource <- read_html(myclient$getPageSource()[[1]])
selected_node <- myPagesource %>% html_node("#fvData")
#get content as vectors in list and merge into table
result_list <- map(1:7,~ myPagesource %>% html_nodes(str_c(".c",.x)) %>% html_text())
result_list2 <- map(c(5,6),~myPagesource %>% html_nodes(glue::glue("tr>td:nth-child({i})",i=.x)) %>% html_text())
result_list[[5]] <- c(result_list[[5]],result_list2[[1]])
result_list[[6]] <- c(result_list[[6]],result_list2[[2]])
result_df <- do.call("cbind", result_list)
colnames(result_df) <- result_df[1,]
result_df <- as.tibble(result_df[-1,])

之后你可以做一些数据清理。

【讨论】：