【问题标题】:Web scraping the required content from a url link in R网页从 R 中的 url 链接中抓取所需的内容
【发布时间】:2021-02-06 22:58:11
【问题描述】:

我对网络抓取非常陌生,并试图从链接中抓取所需的内容。

这是上图的实际网址:https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec

我希望输出如下所示:

Sections Found                                    Instructors              email id
Academic Strategies - 10582 - ACAD 1100 - 001    Beverly McPhail  
Academic Strategies - 10586 - ACAD 1100 - 002    Emily K Mann      
Academic Strategies - 10590 - ACAD 1100 - 005    Christopher D Bourque    

我看到email id 不可见,我只能看到符号。我在 R 中看到了 rvest 包并开始使用如下所示,但我看到一个错误:

library(rvest)
url <- read_html("https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec")

Error in open.connection(x, "rb") : HTTP error 500.

去图片中的数据:

In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched` 
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ACAD Academics -> scroll down and click Class Search

这会将您带到链接https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec

我可以知道如何在 R 中进行这种类型的抓取吗?比q

【问题讨论】:

    标签: html r xml web-scraping rvest


    【解决方案1】:

    这很棘手。该网页仅在服务器接收到具有适当形式的 POST 请求后才提供服务,因此这不是像 read_html 那样向 url 发送普通 GET 请求的简单情况。您需要“手动”构建 POST 请求以获取您想要的页面。

    library(rvest)
    #> Loading required package: xml2
    
    url   <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
    
    query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
                  sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
                  sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
                  sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ACAD",
                  sel_crse = "",      sel_title = "",     sel_insm = "%",
                  sel_from_cred = "", sel_to_cred = "",   sel_camp = "%",
                  sel_levl = "%",     sel_ptrm = "%",     sel_instr = "%",
                  sel_attr = "%",     begin_hh =  "0",    begin_mi = "0",
                  begin_ap = "a",     end_hh = "0",       end_mi = "0",
                  end_ap = "a")
    
    html <- read_html(httr::POST(url, body = query))
    

    一旦你有了 html,你就可以使用 xpath 来获取你想要抓取的节点:

    classes <- html %>% html_nodes(xpath = "//th/a") %>% html_text()
    
    instructor_nodes <- html %>% 
      html_nodes(xpath = "//td[@class='dddefault']/a[contains(@href, 'mailto')]")
      
    instructors <- html_attr(instructor_nodes, "target") 
    
    emails <- html_attr(instructor_nodes, "href") 
    
    df <- data.frame(classes, instructors, emails)
    
    df
    #>                                         classes            instructors
    #> 1 Academic Strategies - 10582 - ACAD 1100 - 001        Beverly McPhail
    #> 2 Academic Strategies - 10586 - ACAD 1100 - 002          Emily K. Mann
    #> 3 Academic Strategies - 10590 - ACAD 1100 - 005 Christopher D. Bourque
    #>                        emails
    #> 1 mailto:blahblah@memphis.edu
    #> 2   mailto:blahbl@memphis.edu
    #> 3 mailto:blahblah@memphis.edu
    

    请注意,我显然隐藏了相关个人的电子邮件地址,而不是在未经他们同意的情况下将其发布在公共网页上。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2023-03-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-11-16
      • 2019-07-25
      • 1970-01-01
      相关资源
      最近更新 更多