使用 R 和选择器小工具进行 Web 抓取答案

【问题标题】：Web scraping with R and selector gadget使用 R 和选择器小工具进行 Web 抓取
【发布时间】：2017-10-27 23:11:54
【问题描述】：

我正在尝试使用 R 从a website 中抓取数据。我正在使用rvest 来模仿an example scraping the IMDB page for the Lego Movie。该示例提倡使用名为Selector Gadget 的工具来帮助轻松识别与您要提取的数据相关联的html_node。

我最终对构建具有以下架构/列的数据框感兴趣： rank、blog_name、facebook_fans、twitter_followers、alexa_rank。

我的代码如下。我能够使用 Selector Gadget 正确识别乐高示例中使用的 html 标签。但是，按照与乐高示例相同的过程和相同的代码结构，我得到了 NA (...using firstNAs introduced by coercion[1] NA)。我的代码如下：

data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
  html_node(".stats") %>%
  html_text() %>%
  as.numeric()

我还尝试过：html_node("html_node(".stats , .stats span"))，它似乎适用于“Facebook 粉丝”列，因为它报告了 714 场比赛，但只返回 1 个号码。

714 matches for .//*[@class and contains(concat(' ', normalize-space(@class), ' '), ' stats ')] | .//*[@class and contains(concat(' ', normalize-space(@class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>

【问题讨论】：

标签： r web-scraping html-parsing rvest

【解决方案1】：

这可能会对您有所帮助：

library(rvest)

d1 <- read_html("http://blog.feedspot.com/video_game_news/")

stats <- d1 %>%
    html_nodes(".stats") %>%
    html_text()

blogname <- d1%>%
    html_nodes(".tlink") %>%
    html_text()

注意是html_nodes（复数）

结果：

> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games"          "Xbox Wire"                  "Official PlayStation Blog" 
[5] "Nintendo Life "             "Game Informer" 

> head(stats,12)
 [1] "997,669"    "1,209,029"  "873"        "4,070,476"  "4,493,805"  "399"        "23,141,452" "10,210,993" "879"       
[10] "38,019,811" "12,059,607" "500"

blogname 返回易于管理的博客名称列表。另一方面，统计信息好坏参半。这是因为 Facebook 和 Twitter 粉丝的 stats 类彼此无法区分。在这种情况下，输出数组具有每三个数字的信息，即 stats = c(fb, tw, alx, fb, tw, alx...)。您应该将每个向量与这个向量分开。

FBstats = stats[seq(1,length(stats),3)]

> head(stats[seq(1,length(stats),3)])
[1] "997,669"    "4,070,476"  "23,141,452" "38,019,811" "35,977"     "603,681"

【讨论】：

【解决方案2】：

您可以使用html_table 以最少的工作提取整个表：

library(rvest)
library(tidyverse)

# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()

game_blogs <- h %>% 
    html_node('table') %>%    # select enclosing table node
    html_table() %>%    # turn table into data.frame
    set_names(make.names) %>%    # make names syntactic
    mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>%    # extract title from name info
    mutate_at(3:5, parse_number) %>%    # make numbers actually numbers
    tbl_df()    # for printing

game_blogs
#> # A tibble: 119 x 5
#>     Rank                  Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#>    <int>                      <chr>         <dbl>             <dbl>      <dbl>
#>  1     1 Kotaku - The Gamer's Guide        997669           1209029        873
#>  2     2          IGN | Video Games       4070476           4493805        399
#>  3     3                  Xbox Wire      23141452          10210993        879
#>  4     4  Official PlayStation Blog      38019811          12059607        500
#>  5     5              Nintendo Life         35977             95044      17727
#>  6     6              Game Informer        603681           1770812      10057
#>  7     7            Reddit | Gamers       1003705            430017         25
#>  8     8                    Polygon        623808            485827       1594
#>  9     9   Xbox Live's Major Nelson         65905            993481      23114
#> 10    10                      VG247        397798            202084       3960
#> # ... with 109 more rows

值得检查所有内容是否按照您的意愿进行解析，但此时应该可以使用。

【讨论】：

这看起来很酷，但我无法复制您的结果。错误：game_blogs <- h %>% html_node('table') %>% # select enclosing table node html_table() %>% # turn table into data.frame set_names(make.names) Error: x` 和 nm 长度必须相同`
啊！不好意思，那是使用purrr::set_names的开发版，可以带一个函数。你可以从Github 安装它，或者只使用set_names(make.names(names(.))) 来做同样的事情。

【解决方案3】：

这使用html_nodes（复数）和str_replace 删除数字中的逗号。不确定这些是否是您需要的所有统计数据。

library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
  html_nodes(".stats") %>%
  html_text() %>%
  str_replace_all(',', '') %>%
  as.numeric()

【讨论】：