【问题标题】:Webscraping in R https & webformR https 和 webform 中的网页抓取
【发布时间】:2018-04-12 23:39:39
【问题描述】:

我正在尝试从该网站抓取数据:https://collegereadiness.collegeboard.org/k-12-school-code-search 每个美国州。我对网络抓取还很陌生,但我知道可以从https 页面抓取数据。

这是我尝试过的:

library(httr)

url <- "https://collegereadiness.collegeboard.org/k-12-school-code-search"

AL <- list(
    submit = "submit",
    state  = Alabama
)

“Alabama”和“AL”都不起作用。

我想看看我是否可以获得每个州的数据框,不幸的是,这个网站没有每个州的特定页面。

【问题讨论】:

    标签: r web-scraping


    【解决方案1】:

    在 Windows 上,我可以填写如下表格:

    library(RDCOMClient)
    IEApp <- COMCreate("InternetExplorer.Application")
    IEApp[['Visible']] <- TRUE
    IEApp$Navigate("https://collegereadiness.collegeboard.org/k-12-school-code-search")
    Sys.sleep(2)
    
    web_Obj <- IEApp$Document()$getElementByID("edit-state")
    web_Obj$Click()
    web_Obj$Focus()
    web_Obj[["Value"]] <- "AZ"
    Sys.sleep(2)
    
    doc <- IEApp$Document()
    clickEvent <- doc$createEvent("MouseEvent")
    clickEvent$initEvent("click", TRUE, FALSE)
    obj <- IEApp$Document()$getElementById("edit-submit")
    obj$dispatchEvent(clickEvent)
    Sys.sleep(2)
    
    text <- doc$documentElement()$innerText()
    
    # Now, we have to clean the text, but the values are there ...
    

    此外,以下是如何访问表单其他元素的示例:

    web_Obj <- IEApp$Document()$getElementByID("edit-school-name")
    web_Obj$Click()
    web_Obj$Focus()
    web_Obj[["Value"]] <- "..."
    
    web_Obj <- IEApp$Document()$getElementByID("edit-country")
    web_Obj$Click()
    web_Obj$Focus()
    web_Obj[["Value"]] <- "..."
    
    web_Obj <- IEApp$Document()$getElementByID("edit-city")
    web_Obj$Click()
    web_Obj$Focus()
    web_Obj[["Value"]] <- "..."
    
    web_Obj <- IEApp$Document()$getElementByID("edit-zip")
    web_Obj$Click()
    web_Obj$Focus()
    web_Obj[["Value"]] <- "..."
    
    web_Obj <- IEApp$Document()$getElementByID("edit-proximity")
    web_Obj$Click()
    web_Obj$Focus()
    web_Obj[["Value"]] <- "..."
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-02-27
      • 1970-01-01
      • 2011-08-15
      • 2018-02-03
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多