如何使用 R 从 Wikipedia 中抓取数据答案

【问题标题】：How to scrape data from Wikipedia using R如何使用 R 从 Wikipedia 中抓取数据
【发布时间】：2015-10-27 05:38:39
【问题描述】：

我需要使用网页https://en.wikipedia.org/wiki/Category:Clothing_brands_by_country 在 R 中为按国家/地区列出的服装零售商列表创建一个表格。

我尝试查看各种链接，但找不到任何有效的链接。现在的基本需求是能够从页面中提取链接，然后强制它打开并从中抓取数据。

library(XML)
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))


path<-"https://en.wikipedia.org/wiki/Category:Clothing_brands_by_country"
webpage <- getURL(path)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE, encoding=FALSE)

【问题讨论】：

没有选项...我知道 R 用于标准分析，而不是用于抓取。可以用 Python 做到这一点，但不幸的是不是一个选项
检查 rvest 包和演示。这可能会进一步帮助您。或者只是复制并粘贴信息。
你的代码目前是做什么的？
cran.r-project.org/web/packages/WikipediaR/index.html

标签： r web-scraping wikipedia

【解决方案1】：

想通了，不知道 HTML 是主要问题。：

library(XML)
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem",     package = "RCurl")))
path<-"http://en.wikipedia.org/wiki/Category:Clothing_brands_by_country"
webpage <- getURL(path)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE, encoding=FALSE) 
q='//a[@class="CategoryTreeLabel  CategoryTreeLabelNs14 CategoryTreeLabelCategory"]'

a<-xpathSApply(pagetree, q, xmlGetAttr,'href')
t <- gsub('\\s', '', a,)
x<-data.frame(t)
x$pos<-gregexpr(pattern ='of_',x$t)
x$country<-substr(substr(x$t,x$pos,10000),4,10000)
x$url<-paste("https://en.wikipedia.org",x$t,sep="")

chk<-x[1,]
chk2<-chk$url
country<-chk$country
webpage <- getURL(chk2)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE, encoding=FALSE)
q<-'//div[@class="mw-content-ltr"]//ul/li/a'
a<-xpathSApply(pagetree, q, xmlGetAttr,'title')
n<-data.frame(a)
n$country<-country
fin<-n

for (i in 2:25)
{
  chk<-x[i,]
  chk2<-chk$url
  country<-chk$country
  webpage <- getURL(chk2)
  webpage <- readLines(tc <- textConnection(webpage)); close(tc)
  pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE, encoding=FALSE)
  q<-'//div[@class="mw-content-ltr"]//ul/li/a'
  a<-xpathSApply(pagetree, q, xmlGetAttr,'title')
  n<-data.frame(a)
  n$country<-country
  fin<-rbind(fin,n)
}

【讨论】：