【问题标题】:Web scraping into R multiple links with similar URL using a for loop or lapply使用 for 循环或 lapply 将 Web 抓取到具有相似 URL 的 R 多个链接
【发布时间】:2016-05-01 06:00:37
【问题描述】:

此代码从此处http://www.bls.gov/schedule/news_release/2015_sched.htm 抓取每个包含“发布”列下的就业情况的日期。

pg <- read_html("http://www.bls.gov/schedule/news_release/2015_sched.htm")

# target only the <td> elements under the bodytext div
body <- html_nodes(pg, "div#bodytext")

# we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")

# clean up the cruft and make our dates!
nfpdates2015 <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")

###thanks @hrbrmstr for this###

我想对包含其他年份的其他 URL 重复这一点,以相同的方式命名,仅更改年份编号。特别是对于以下 URL:

#From 2008 to 2015
http://www.bls.gov/schedule/news_release/2015_sched.htm
http://www.bls.gov/schedule/news_release/2014_sched.htm
...
http://www.bls.gov/schedule/news_release/2008_sched.htm

我对@9​​87654324@、HTMLXML 的了解几乎不存在。我想用 for 循环应用相同的代码,但我的努力是徒劳的。当然,我可以将 2015 年的代码重复八次以获得所有年份,既不会花费太多时间,也不会占用太多空间。然而,我很想知道如何以更有效的方式做到这一点。谢谢。

【问题讨论】:

    标签: html r for-loop web-scraping lapply


    【解决方案1】:

    在循环中,您将使用paste0 状态更改url 字符串

    for(i in 2008:2015){
    
      url <- paste0("http://www.bls.gov/schedule/news_release/", i, "_sched.htm")
      pg <- read_html(url)
    
      ## all your other code goes here.
    
    }
    

    或者使用lapply 来返回结果列表。

    lst <- lapply(2008:2015, function(x){
      url <- paste0("http://www.bls.gov/schedule/news_release/", x, "_sched.htm")
    
      ## all your other code goes here.
      pg <- read_html(url)
    
      # target only the <td> elements under the bodytext div
      body <- html_nodes(pg, "div#bodytext")
    
      # we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
      es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")
    
      # clean up the cruft and make our dates!
      nfpdates <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
      return(nfpdates)
    })
    

    返回

     lst
    [[1]]
     [1] "2008-01-04" "2008-02-01" "2008-03-07" "2008-04-04" "2008-05-02" "2008-06-06" "2008-07-03" "2008-08-01" "2008-09-05"
    [10] "2008-10-03" "2008-11-07" "2008-12-05"
    
    [[2]]
     [1] "2009-01-09" "2009-02-06" "2009-03-06" "2009-04-03" "2009-05-08" "2009-06-05" "2009-07-02" "2009-08-07" "2009-09-04"
    [10] "2009-10-02" "2009-11-06" "2009-12-04"
    
    ## etc...
    

    【讨论】:

    • 谢谢 Symbolix!这两种方法都很好用,lapply 的速度要快得多。然而,这两个变量都没有记录所有 8 年的日期(或 8 个不同的变量)。在这两种情况下,nfpdates 仅存储去年(即 2015 年)。这怎么可能实现?
    • @Gracos lapply 返回一个列表(长度为 8)。如果将lapply 分配给变量,则可以访问所有返回的结果。查看我的更新。
    【解决方案2】:

    这可以通过sprintf 完成(没有循环)

    url <- sprintf("http://www.bls.gov/schedule/news_release/%d_sched.htm", 2008:2015)
    url
    #[1] "http://www.bls.gov/schedule/news_release/2008_sched.htm" "http://www.bls.gov/schedule/news_release/2009_sched.htm"
    #[3] "http://www.bls.gov/schedule/news_release/2010_sched.htm" "http://www.bls.gov/schedule/news_release/2011_sched.htm"
    #[5] "http://www.bls.gov/schedule/news_release/2012_sched.htm" "http://www.bls.gov/schedule/news_release/2013_sched.htm"
    #[7] "http://www.bls.gov/schedule/news_release/2014_sched.htm" "http://www.bls.gov/schedule/news_release/2015_sched.htm"
    

    如果我们需要阅读链接

    library(rvest)
    lst <-  lapply(url, function(x) {
    
       pg <- read_html(x)
       body <- html_nodes(pg, "div#bodytext")
       es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")
    
       nfpdates <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
       nfpdates
      })
    
    head(lst, 3)
    #[[1]]
    # [1] "2008-01-04" "2008-02-01" "2008-03-07" "2008-04-04" "2008-05-02" "2008-06-06" "2008-07-03" "2008-08-01"
    # [9] "2008-09-05" "2008-10-03" "2008-11-07" "2008-12-05"
    
    #[[2]]
    # [1] "2009-01-09" "2009-02-06" "2009-03-06" "2009-04-03" "2009-05-08" "2009-06-05" "2009-07-02" "2009-08-07"
    # [9] "2009-09-04" "2009-10-02" "2009-11-06" "2009-12-04"
    
    #[[3]]
    # [1] "2010-01-08" "2010-02-05" "2010-03-05" "2010-04-02" "2010-05-07" "2010-06-04" "2010-07-02" "2010-08-06"
    # [9] "2010-09-03" "2010-10-08" "2010-11-05" "2010-12-03"
    

    【讨论】:

    • 非常感谢 akrun,非常感谢。您的答案与 Symbolix 的几乎相同,我会接受他的答案。在接下来的 20 小时内,我没有投票权。
    • @Gracos 是的,它是 pastesprintf。否则,你几乎已经完成了所有的基础工作
    • @akrun 从你的脑海中浮现出来,你知道使用 sprintfpaste0 有什么好处吗?
    • @Symbolix 我认为速度上没有任何区别,但是使用sprintf,您可以在单个字符串上使用它,而使用paste0,我们将多个子字符串粘贴在一起。
    猜你喜欢
    • 2022-08-04
    • 1970-01-01
    • 2016-12-03
    • 1970-01-01
    • 2017-12-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多