【问题标题】:missing information in crawling data抓取数据时缺少信息
【发布时间】:2017-07-18 09:15:33
【问题描述】:

我想用R爬取XXX中所有与AlphaGo相关的新闻(title,url,text),页面url为http://www.xxxxxx.com/search/?q=AlphaGo。这是我的代码:

url <- "http://www.xxxxxx.com/search/?q=AlphaGo"
info <- debugGatherer()
handle <- getCurlHandle(cookiejar ="",
                        #turn the page
                        followlocation = TRUE,
                        autoreferer = TRUE,
                        debugfunc = info$update,
                        verbose = TRUE,
                        httpheader = list(
                          from = "eddie@r-datacollection.com",
                          'user-agent' = str_c(R.version$version.string,
                                               ",",R.version$platform)
                        ))
html <- getURL(url,curl=handle,header = TRUE)
parsedpage <- htmlParse(html)

但是,当我使用代码时

xpathSApply(parsedpage,"//h3//a",xmlGetAttr,"href")

检查是否找到了目标代码,发现相关新闻信息的所有内容都丢失了。然后我发现按F12后的DOM elements(Chrome是我用的)包含了我想要的信息,而sources里面什么都没有(真的很乱,所有的元素都堆在一起了)。所以我将代码更改为:

parsed_page <- htmlTreeParse(file = url,asTree = T)

希望获得 dom 树。 不过,这一次信息丢失了,我发现所有丢失的信息都是DOM elements中折叠的信息(我以前从未遇到过这种情况)。

知道问题是如何发生的以及如何解决这个问题吗?

【问题讨论】:

  • 你想要的输出是什么?每个页面的 url 或文本列表?
  • 他们俩,我的代码有问题吗?
  • 您违反了 CNN ToC 中的第 3 项。请确保您告知其他人您要求他们帮助您做出可能导致他们罚款或入狱的不道德行为。
  • 亲爱的@hrbrmstr,感谢您的建议,我会删除相关信息,但会留下一般问题本身。它也纯粹用于学术和个人用途,但我完全理解您的担忧。谢谢。

标签: html r dom web-crawler


【解决方案1】:

问题不在于您的代码。结果页面是动态生成的,因此结果页面中的纯 html 格式的链接和文本不可用(如您查看源代码所见)。

只有 10 个结果,所以我建议你手动创建一个 url 列表。

我不知道您在这段代码中使用的包。但我建议你选择rvest,这似乎比你使用的包简单得多。

对于:

url <- "http://money.cnn.com/2017/05/25/technology/alphago-china-ai/index.html"

library(rvest)
library(tidyverse)

url %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="storytext"]/p') %>% 
  html_text()

 [1] " A computer system that Google engineers trained to play the game Go beat the world's best human player Thursday in China. The victory was AlphaGo's second this week over Chinese professional Ke Jie, clinching the best-of-three series at the Future of Go Summit in Wuzhen.  "                                  
 [2] " Afterward, Google engineers said AlphaGo estimated that the first 50 moves -- by both players -- were virtually perfect. And the first 100 moves were the best anyone had ever played against AlphaGo's master version. "                                                                                           
 [3] " Related: Google's man-versus-machine showdown is blocked in China "                                                                                                                                                                                                                                                 
 [4] " \"What an amazing and complex game! Ke Jie pushed AlphaGo right to the limit,\" said DeepMind CEO Demis Hassabis on Twitter. DeepMind is a British artificial intelligence company that developed AlphaGo and was purchased by Google in 2014. "                                                                    
 [5] " DeepMind made a stir in January 2016 when it first announced it had used artificial intelligence to master Go, a 2,500-year-old game. Computer scientists had struggled for years to get computers to excel at the game. "                                                                                          
 [6] " In Go, two players alternate placing white and black stones on a grid. The goal is to claim the most territory. To do so, you surround your opponent's pieces so that they're removed from the board. "                                                                                                             
 [7] " The board's 19-by-19 grid is so vast that it allows a near infinite combination of moves, making it tough for machines to comprehend. Games such as chess have come quicker to machines. "                                                                                                                          
 [8] " Related: Elon Musk's new plan to save humanity from AI "                                                                                                                                                                                                                                                            
 [9] " The Google engineers at DeepMind rely on deep learning, a trendy form of artificial intelligence that's driving remarkable gains in what computers are capable of. World-changing technologies that loom on the horizon, such as autonomous vehicles, rely on deep learning to effectively see and drive on roads. "
[10] " AlphaGo's achievement is also a reminder of the steady improvement of machines' ability to complete tasks once reserved for humans. As machines get smarter, there are concerns about how society will be disrupted, and if all humans will be able to find work. "                                                 
[11] " Historically, mankind's development of tools has always created new jobs that never existed before. But the gains in artificial intelligence are coming at a breakneck pace, which will likely accentuate upheaval in the short term. "                                                                             
[12] " Related: Google uses AI to help diagnose breast cancer "                                                                                                                                                                                                                                                            
[13] " The 19-year-old Ke and AlphaGo will play a third match Saturday morning. The summit will also feature a match Friday in which five human players will team up against AlphaGo. "      

最好的

科林

【讨论】:

  • 我仔细考虑了您的方法,如果从我提供的页面开始工作,即使使用rvest,这似乎是一个巨大的项目,因为您在这里所做的只是解析html文件每个新闻页面,这绝对是简单的。如果我们需要爬取 url,而不是自己生成 url 怎么办?
【解决方案2】:

根据@Colin 提供的想法,我尝试按照原始代码进行操作。所以我对带有包RJSONIO的JSON文件中的动态内容进行如下编码

url <- "https://search.xxxxxx.io/content?q=AlphaGo"
content <- fromJSON(url)
content1 <- content$result
content_result <- matrix(NA,10,5)
for(i in 1:length(content1)){
  content_result[i,] <- c("CNN", content1[[i]]$firstPublishDate,ifelse(class(content1[[i]]$headline) != "NULL",content1[[i]]$headline,"NA"),
                         content1[[i]]$body,content1[[i]]$url)
}

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-11-23
    • 2023-03-17
    • 2014-03-10
    • 2010-10-11
    • 1970-01-01
    • 1970-01-01
    • 2018-01-13
    相关资源
    最近更新 更多