使用 rvest 抓取网站 - 选择 html 节点？答案

【问题标题】：Using rvest to scrape a website - Selecting html node?使用 rvest 抓取网站 - 选择 html 节点？
【发布时间】：2017-03-22 07:17:45
【问题描述】：

我对我最近的 r 背心刮伤有疑问。

我想抓取这个页面（以及其他一些股票）： http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1

我需要一份市值清单，即第二行的第一个方框。此列表应包含大约 50-100 只股票。

我为此使用 rvest。

library(rvest)

html = read_html("http://www.finviz.com/quote.ashx?t=A")

cast = html_nodes(html, "table-dark-row")

问题是，我无法绕过 html_nodes。关于如何找到 html_nodes 的正确节点的任何想法？

我正在使用 firebug/firefinder 查看网页。

【问题讨论】：

标签： r quantmod rvest

【解决方案1】：

不确定这是否是您想要的，因为我找不到带有 aprox 的列表。 50-100 只股票。

但值得一提的是，使用 SelectorGadget 我想出了这个节点 .table-dark-row:nth-child(2) .snapshot-td2:nth-child(2) 来选择市值（本页第二行的第一个框http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1)。

> library(rvest)
> 
> html = read_html("http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1")
> 
> cast = html_nodes(html, ".table-dark-row:nth-child(2) .snapshot-td2:nth-child(2)")
> cast
{xml_nodeset (1)}
[1] <td width="8%" class="snapshot-td2" align="left">\n  <b>11.58B</b>\n</td>
>

如果这不是您想要的，只需使用 SelectorGadget 找到您想要的。

希望这会有所帮助。

编辑：

这里完整的解决方案：

library(rvest)

html = read_html("http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1")

cast = html_nodes(html, ".table-dark-row:nth-child(2) .snapshot-td2:nth-child(2)")

html_text(cast) %>%
    gsub(pattern = "B", replacement = "") %>%
    as.numeric()

【讨论】：

那个看起来很合法。我需要弄清楚如何从字符串中提取数字。
使用来自同一个 rvest 包的函数 html_text()。 html_text(cast) 为您提供“12.76B”，然后，要将其转换为数字，您需要去掉 B（我不知道这意味着什么）。我编辑回答。在那里查看完整的解决方案。