R - 网页抓取答案

【问题标题】：R - Web ScrapingR - 网页抓取
【发布时间】：2017-02-27 13:31:14
【问题描述】：

首先非常感谢您的帮助和考虑，

我想从这个页面中提取一些东西http://ieeexplore.ieee.org/document/6875970/keywords 一旦你成为网站的一员，我就有兴趣在网络上抓取相关的信息

IEEE 关键字
INSPEC：受控索引
INSPEC：非受控索引

install.packages("rvest") 图书馆（rvest）关键字

关键字 %>% 一个% html_text(a)

但它不起作用！

你能帮帮我吗？

非常感谢！

【问题讨论】：

你能不能再具体一点“但这不起作用！”
是的，当然！我想要的是从网站中提取所有关键字，例如地理空间分析，决策制定，......但是当我执行我的几行代码时，我所做的是“标记化错误（css）：意外字符'/'在位置 5" 找到。因此，我担心我的代码不符合我的期望。因为我是 R 初学者，看了一些 R 教程（Lego_movies，但它是用 HTLM 编写的）。如果我没记错的话，我的网页是用 JavaScript 编写的。感谢您的帮助:)
我考虑过为您重新格式化此文件，但您至少在%>% a <- 上遇到了一些不应该发生的奇怪事情。除此之外，您希望抓取的网站具有使用条款...“访客/会员用户不得执行以下操作：[...] 通过电子邮件或任何其他文件传输以电子方式传输协议，IEEE Xplore 的任何部分。您可能需要考虑是否允许您在获得此信息后使用它。

标签： r web web-scraping

【解决方案1】：

# if you get an error while importing, just use install.packages('jsonlite') and /or install.packages('stringr')
library(jsonlite) 
library(stringr) 

# the ieee doc url we're interested in
url <- 'http://ieeexplore.ieee.org/document/6875970/keywords'

# read the document as text, no HTML parsing at all
doc <- readLines(url)

# after inspecting it, we notice it's built with angular-js,
# and the data we need to extract is defined as a single javascript variable.

# so, we first find the id of the line which defines 
# the javascript variable containing the tags
idx <- which(!is.na(str_match(doc, 'global.document.metadata=')))

# get the line ;-)
line <- doc[idx]

# since it's a javascript variable, we need to massage it
# a little to be able to read it as json.
# step 1: remove the "var global.document.metadata=" part (everything before the actual json)
line <- str_replace(line, '^[^{]*', '')

# step 2: remove the trailing ';' symbol
line <- str_replace(line, ';$', '')

# now we can parse the json data
df <- fromJSON(line)

# and get the information we need
df$keywords[df$keywords$type == 'IEEE Keywords',]$kwd[[1]]

df$keywords[df$keywords$type == 'INSPEC: Controlled Indexing',]$kwd[[1]]

df$keywords[df$keywords$type == 'INSPEC: Non-Controlled Indexing',]$kwd[[1]]

样本输出：

[1] "CTC incident datasets"                             
[2] "proactive spatiotemporal resource allocation"      
[3] "predictive visual analytics"                  
[4] "community policing"                               
[5] "law enforcement"

【讨论】：

有效！！！非常感谢莱昂纳多·福德拉罗！你是最棒的！！！！而且超级清晰！
真的谢谢你！非常非常有帮助！如果有一天我可以在金融或统计领域为您提供帮助，请告诉我:)
（我没有提到金融，因为它远远超出了我的技能，哈哈）
Leonardo 能给我发一封电子邮件吗？我有一个问题，我想你可以回答。 gmail.com 上的 rms7272 只需将 at 替换为 @
嗨，Ryan，我刚刚给你发了一封主题为“来自 StackOverflow 的联系人”的邮件。