如何用 rvest 和 xpath 刮桌子？

【问题标题】：How to scrape a table with rvest and xpath?如何用 rvest 和 xpath 刮桌子？
【发布时间】：2016-02-29 19:06:55
【问题描述】：

使用以下 documentation 我一直在尝试从 marketwatch.com 上抓取一系列表格

这是下面代码所代表的：

链接和xpath已经包含在代码中：

url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation <- url %>%
  html() %>%
  html_nodes(xpath='//*[@id="maincontent"]/div[2]/div[1]') %>%
  html_table()
valuation <- valuation[[1]]

我收到以下错误：

Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")

提前致谢。

【问题讨论】：

删除html()并替换为read_html()
这不是错误，而是警告。您的代码仍会在该警告下运行。

标签： r xpath web-scraping rvest

【解决方案1】：

那个网站没有使用 html 表格，所以html_table() 找不到任何东西。它实际上使用div 类column 和data lastcolumn。

所以你可以做类似的事情

url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation_col <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="column"]')
    
valuation_data <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="data lastcolumn"]')

甚至

url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="section"]')

让您顺利抵达目的地。

还请阅读他们的terms of use - 特别是 3.4。

【讨论】：

如何找到xpath（有工具可以找到它，你能把它添加到答案中）
右键单击元素并选择“检查”。然后只需阅读html