【问题标题】:Unable to scrape website with form using rvest无法使用 rvest 抓取带有表单的网站
【发布时间】:2021-05-22 18:37:31
【问题描述】:

我正在尝试抓取下面列出的以下网站。我尝试通过使用rvest 和下面的代码来做到这一点。

我的尝试是尝试复制我在 Google Chrome 中为下载按钮找到的 PUT。我不确定我做错了什么。我收到了reprex 中列出的错误。

  library(httr)
  library(rvest)
  library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

  
  
  url <- "https://nfc.shgn.com/adp/baseball"
  pgsession <- session(url)
  
  pgform <- html_form(pgsession)[[2]]

  filled_form <- html_form_set(pgform,
                            team_id = "0", from_date = "2020-10-01", to_date = "2021-02-19", num_teams = "0",
                            draft_type = "0", sport = "baseball", position = "",
                            league_teams = "0" )
#> Warning: Setting value of hidden field 'team_id'.
#> Warning: Setting value of hidden field 'from_date'.
#> Warning: Setting value of hidden field 'to_date'.
#> Warning: Setting value of hidden field 'num_teams'.
#> Warning: Setting value of hidden field 'draft_type'.
#> Warning: Setting value of hidden field 'sport'.
#> Warning: Setting value of hidden field 'position'.
#> Warning: Setting value of hidden field 'league_teams'.
  
  session_submit(x = pgsession, form = filled_form)
#> Error: `form` doesn't contain a `action` attribute

【问题讨论】:

    标签: r rvest


    【解决方案1】:

    如果您只想抓取该表,您可以使用“打印”按钮将您带到的 URL 使用 rvestpurrr 轻松完成。

    虽然您不能使用html_table,但使用purrr::map_df 将单元格提取为数据框很简单:

    library(rvest)
    library(dplyr)
    library(purrr)
    library(stringr)
    
    pgtab <- read_html("https://nfc.shgn.com/adp.data.php") %>%  #destination of Print button
      html_nodes("tr") %>%                 #returns a list of row nodes
      map_df(~html_nodes(., "td") %>%      #returns a list of cell nodes for each row
               html_text() %>%             #extract text
               str_trim() %>%              #remove whitespace
               set_names("Rank","Player","Team","Position","ADP","MinPick",
                         "MaxPick","Diff","Picks","Team2","PickBid"))
    
    head(pgtab)
    
    # A tibble: 6 x 11
      Rank  Player             Team  Position ADP   MinPick MaxPick Diff  Picks Team2 PickBid
      <chr> <chr>              <chr> <chr>    <chr> <chr>   <chr>   <chr> <chr> <chr> <chr>  
    1 1     Ronald Acuna Jr.   ATL   OF       1.69  1       6       ""    332   ""    ""     
    2 2     Fernando Tatis Jr. SD    SS       2.57  1       7       ""    332   ""    ""     
    3 3     Mookie Betts       LAD   OF       3.53  1       9       ""    332   ""    ""     
    4 4     Juan Soto          WAS   OF       3.98  1       10      ""    332   ""    ""     
    5 5     Mike Trout         LAA   OF       6.08  1       11      ""    332   ""    ""     
    6 6     Gerrit Cole        NYY   P        6.50  1       15      ""    332   ""    ""     
    

    您也可以设置表单参数并执行此操作,但您必须检查它是否有所作为。这是一种方法...

    url <- "https://nfc.shgn.com/adp/baseball"
    pgsession <- html_session(url)
    
    pgform <- html_form(pgsession)[[2]]
    
    filled_form <-set_values(pgform,
                             team_id = "0", from_date = "2020-10-01", to_date = "2021-02-19", num_teams = "0",
                             draft_type = "0", sport = "baseball", position = "",
                             league_teams = "0" )
    
    filled_form$url <- "https://nfc.shgn.com/adp.data.php" #error if this is left blank
    
    pgsession <- submit_form(pgsession, filled_form, submit = "printerFriendly")
    
    pgtab <- pgsession %>% read_html() %>% #code as per previous answer above
      html_nodes("tr") %>% 
      map_df(~html_nodes(., "td") %>% 
               html_text() %>% 
               str_trim() %>% 
               set_names("Rank","Player","Team","Position","ADP","MinPick",
                         "MaxPick","Diff","Picks","Team2","PickBid"))
    

    【讨论】:

    • 好地方,尤其是因为 javascript 不运行才有效!
    • 谢谢!这行得通!几个问题,允许从我列出的表单中选择查询字段的附加选项是什么样的?为什么你对 td 节点使用 map_df 而不是 tr 节点?我将如何为每一行提取数据播放器数据?
    • @Jazzmatazz 我会看看表单字段是否可以更改,虽然我认为它们只是过滤器,所以可能只是给你整个表格的一个子集。我认为要获得详细的玩家数据,您需要使用 rselenium。 html_nodes 生成一个列表,为了保留列表结构,有必要使用 map 遍历它并为每一行生成一个嵌套的单元格列表。 map_df 将嵌套列表强制转换为数据框(只要您设置了一些名称)。这只是将其放入数据框的最简洁方法。
    • @Jazzmatazz 我在上面的答案中添加了一个设置表单值的示例。
    • 在运行上面的 pgsession 代码后,我仍然收到错误“错误:form 不包含action 属性”。
    【解决方案2】:

    这是一个可能的解决方案,使用rSelenium 将 tsv.file 下载到给定文件夹。 在那之后,简单的peasy...

    library( RSelenium )
    library( rvest )
    library( xml2 )
    library( data.table )
    
    #setup download file + location
    filename <- "ADP.tsv"
    download_location <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
    
    #create extra cpabilities, so the browser(firefox) does not display an save-as dialog 
    # when downloading the tsv file
    eCaps <- makeFirefoxProfile( list( "browser.download.dir" = download_location,
                                       "browser.download.folderList" = 2, 
                                       "browser.helperApps.neverAsk.saveToDisk" = "text/tab-separated-values",
                                       "browser.download.manager.showWhenStarting" = FALSE ) )
    #setup driver (using the firefox profile created before), client and server
    driver <- rsDriver( browser = "firefox", port = 4545L, extraCapabilities = eCaps, verbose = FALSE )
    server <- driver$server
    browser <- driver$client
    
    #goto url in browser
    browser$navigate( "https://nfc.shgn.com/adp/baseball" )
    #get 
    button_dl <- list()
    #while no buttons found (site not loaded), try to load the download-button
    while ( length( button_dl ) == 0 ) {
      button_dl <- browser$findElements(using = "name", "download" )
    }
    #now click the button and wait for the file to show up in the download_location
    button_dl[[1]]$clickElement()
    #wait for download to complete
    Sys.sleep(5)
    #check if file is loaded
    if ( file.exists( paste( download_location, filename, sep = "/" ) ) ) {
      #load the file
      DT <- data.table::fread( paste( download_location, filename, sep = "/" ) )
    }
    #close everything down properly
    browser$close()
    server$stop()
    
    head(DT)
    #    Rank              Player Team Position(s)  ADP Min Pick Max Pick Difference # Picks Team Team Pick
    # 1:    1   Acuna Jr., Ronald  ATL          OF 1.68        1        6         NA     323   NA        NA
    # 2:    2 Tatis Jr., Fernando   SD          SS 2.58        1        7         NA     323   NA        NA
    # 3:    3       Betts, Mookie  LAD          OF 3.50        1        9         NA     323   NA        NA
    # 4:    4          Soto, Juan  WAS          OF 3.98        1       10         NA     323   NA        NA
    # 5:    5         Trout, Mike  LAA          OF 6.06        1       11         NA     323   NA        NA
    # 6:    6        Cole, Gerrit  NYY           P 6.52        1       15         NA     323   NA        NA
    

    【讨论】:

    • 我以前从未使用过 RSelenium。你能给我指出一个可以帮助我设置 Firefox 配置文件的资源吗?运行 rsDriver 时出现错误。
    • 我建议先遵循一些基本的硒教程...
    猜你喜欢
    • 1970-01-01
    • 2017-09-27
    • 1970-01-01
    • 1970-01-01
    • 2016-06-14
    • 2017-10-02
    • 1970-01-01
    • 1970-01-01
    • 2020-01-30
    相关资源
    最近更新 更多