使用 R 抓取 HTML 数据答案

【问题标题】：Webscraping HTML data using R使用 R 抓取 HTML 数据
【发布时间】：2020-02-09 03:08:27
【问题描述】：

我正在尝试使用 R 从以下网站抓取一些数据以获取

的数据框

library(dplyr)
library(rvest)
library(RCurl)

ebsite1
website1 %>%
  html_nodes(".miscTxt a") %>%
  html_text() -> countries_list

countries_list

我遇到了障碍，因为我不确定如何将各大洲分配给国家并需要帮助

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

这是一个将各大洲和国家抓取到数据框的脚本。

遍历源代码的li 节点
使用helper 函数提取每个li 的大陆和国家/地区
将数据存储在数据框中

# Helper function
helper <- function(li)
  c(continent=gsub('.*href="https://www\\.worldatlas\\.com/webimage/countrys/([A-z .]*?)/.*\\.htm".*', 
                   '\\1', li, perl = TRUE),
    country=rvest::html_text(li))

# Scrape the data
u <- 'https://www.worldatlas.com/cntycont.htm'
continents <- c('africa', 'asia', 'europe', 'namerica', 'oceania', 'samerica')
m <- t(vapply(rvest::html_nodes(xml2::read_html(u), 'li'), helper, FUN.VALUE = character(2)))

# Make a clean data frame
df <- data.frame(m)
df <- df[df$continent %in% continents,]
rownames(df) <- 1:dim(df)[[1]]

# A glimpse
head(df)
#   continent  country
# 1    africa  Algeria
# 2    africa   Angola
# 3    africa    Benin
# 4    africa Botswana
# 5    africa  Burkina
# 6    africa  Burundi

tail(df)
#     continent   country
# 189  samerica    Guyana
# 190  samerica  Paraguay
# 191  samerica      Peru
# 192  samerica  Suriname
# 193  samerica   Uruguay
# 194  samerica Venezuela

【讨论】：