【问题标题】:rvest html_nodes() returns empty characterrvest html_nodes() 返回空字符
【发布时间】:2020-10-19 22:19:39
【问题描述】:

我正在尝试抓取一个网站 (https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281)。特别是,我正在尝试抓取所有 281 个“发布日期”(第一个是“2006 年 10 月 30 日”)

为此,我使用了 R 包 rvest 和 SelectorGadget Chrome 扩展。我使用的是 Mac 版本 10.15.6。

我尝试了以下代码:

library(rvest)
library(httr)
library(xml2)
library(dplyr)

link = "https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281"
page = read_html(link)    
year = page %>% html_nodes("td:nth-child(4) ul") %>% html_text()

但是,这会返回 'character(0)`。

我使用了代码td:nth-child(4) ul,因为这是 SelectorGadget 在 281 个发布日期中突出显示的内容。我也尝试“查看源页面”,但在源页面上找不到这些年份。

我了解到rvest 并不总是有效,具体取决于网站的类型。在这种情况下,什么是可能的解决方法?谢谢。

【问题讨论】:

    标签: html r macos web-scraping rvest


    【解决方案1】:

    此站点从返回 JSON 数据的 API 调用 https://genelab-data.ndc.nasa.gov/genelab/data/study/all 获取数据。您可以使用 httr 获取数据并解析 JSON:

    library(httr)
    
    url <- "https://genelab-data.ndc.nasa.gov/genelab/data/study/all"
    
    output <- content(GET(url), as = "parsed", type = "application/json")
    
    #sort by glds_id
    output = output[order(sapply(output, `[[`, i = "glds_id"))]
    
    #build dataframe
    result <- list();
    index <- 1
    for(t in output[length(output):1]){
        result[[index]] <- t$metadata
        result[[index]]$accession <- t$accession
        result[[index]]$legacy_accession <- t$legacy_accession
        index <- index + 1
    }
    
    df <- do.call(rbind, result)
    options(width = 1200)
    print(df)
    

    输出样本(不包括所有列)

           accession legacy_accession public_release_date title                                                                                                            
      [1,] "GLDS329" "GLDS-329"       "30-Oct-2006"       "Transcription profiling of atm mutant, adm mutant and wild type whole plants and roots of Arabidops" [truncated]
      [2,] "GLDS322" "GLDS-322"       "27-Aug-2020"       "Comparative RNA-Seq transcriptome analyses reveal dynamic time dependent effects of 56Fe, 16O, and " [truncated]
      [3,] "GLDS320" "GLDS-320"       "18-Sep-2014"       "Gamma radiation and HZE treatment of seedlings in Arabidopsis"                                                  
      [4,] "GLDS319" "GLDS-319"       "18-Jul-2018"       "Muscle atrophy, osteoporosis prevention in hibernating mammals"                                                 
      [5,] "GLDS318" "GLDS-318"       "01-Dec-2019"       "RNA seq of tumors derived from irradiated versus sham hosts transplanted with Trp53 null mammary ti" [truncated]
      [6,] "GLDS317" "GLDS-317"       "19-Dec-2017"       "Galactic cosmic radiation induces stable epigenome alterations relevant to human lung cancer"                   
      [7,] "GLDS311" "GLDS-311"       "31-Jul-2020"       "Part two: ISS Enterobacteriales"                                                                                
      [8,] "GLDS309" "GLDS-309"       "12-Aug-2020"       "Comparative Genomic Analysis of Klebsiella Exposed to Various Space Conditions at the International" [truncated]
      [9,] "GLDS308" "GLDS-308"       "07-Aug-2020"       "Differential expression profiles of long non-coding RNAs during the mouse pronucleus stage under no" [truncated]
     [10,] "GLDS305" "GLDS-305"       "27-Aug-2020"       "Transcriptomic responses of Serratia liquefaciens cells grown under simulated Martian conditions of" [truncated]
     [11,] "GLDS304" "GLDS-304"       "28-Aug-2020"       "Global gene expression in response to X rays in mice deficient in Parp1"                                        
     [12,] "GLDS303" "GLDS-303"       "15-Jun-2020"       "ISS Bacillus Genomes"                                                                                           
     [13,] "GLDS302" "GLDS-302"       "31-May-2020"       "ISS Enterobacteriales Genomes"                                                                                  
     [14,] "GLDS301" "GLDS-301"       "30-Apr-2020"       "Eruca sativa Rocket Science RNA-seq"                                                                            
     [15,] "GLDS298" "GLDS-298"       "09-May-2020"       "Draft Genome Sequences of Sphingomonas sp. Isolated from the International Space Station Genome seq" [truncated]
     ...........................................................................
    

    【讨论】:

      猜你喜欢
      • 2019-11-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多