【问题标题】:How to loop to reach each class link and extract out the attribute capacity seats in R如何循环到达每个班级链接并提取R中的属性容量席位
【发布时间】:2023-11-27 22:13:01
【问题描述】:

我实际上想提取此链接中每个classcapacity (seats) 属性。这是实际链接https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec

如果发布的链接不起作用:请这样做

In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched` 
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ARCH Architecture -> scroll down and click Class Search

例如:

对于主题ARCH,类如下所示:

以上图片只是ARCH的几类主题。尽管如此,还是有很多课程。如果单击每个类,您将看到属性 capacity,其中显示了 seats 编号。

我希望输出如下所示:

classes                                                          capacity - seats
Fundamentals of Design Studio - 23839 - ARCH 1111 - 002             15
Design Visualization - 11107 - ARCH 1113 - 001                      15
Building Technology 2 - 23840 - ARCH 2412 - 001                     20

如何在R 中创建一个循环以获取每个subject 的每个classcapacity (seats) 属性。

附:这个问题是我之前帖子https://*.com/questions/64515601/problem-with-web-scraping-of-required-content-from-a-url-link-in-r的延续@

【问题讨论】:

    标签: html r url web-scraping rvest


    【解决方案1】:

    此解决方案与之前的解决方案非常相似。
    由于指向班级大小的链接与班级标题位于同一节点中,因此更直接。根据您在与剩余数据合并之前需要清理哪些类大小表的信息。

    此外,由于您将查询网站上的多个页面,因此请稍作停顿以保持礼貌并避免显得像黑客。
    请注意,没有错误检查以确保正确的表可用,我建议您在制作此生产代码之前考虑这一点。

    #https://*.com/questions/64515601/problem-with-web-scraping-of-required-content-from-a-url-link-in-r/64517844#64517844
    library(rvest)
    library(dplyr)
    
    # In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched` 
    # Select by term -> Spring Term 2021 (view only) -> Submit
    # Subject -> select ARCH Architecture -> scroll down and click Class Search
    
    url   <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
    query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
                  sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
                  sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
                  sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
                  sel_crse = "",      sel_title = "",     sel_insm = "%",
                  sel_from_cred = "", sel_to_cred = "",   sel_camp = "%",
                  sel_levl = "%",     sel_ptrm = "%",     sel_instr = "%",
                  sel_attr = "%",     begin_hh =  "0",    begin_mi = "0",
                  begin_ap = "a",     end_hh = "0",       end_mi = "0",
                  end_ap = "a")
    
    html <- read_html(httr::POST(url, body = query))
    classes <- html %>% html_nodes("th.ddtitle") 
    
    dfs<-lapply(classes, function(class) {
       #get class name
       classname <-class %>% html_text()
       print(classname)
       #Pause in order not be a denial of service attach
       Sys.sleep(0.5)
       classlink <- class %>% html_node("a") %>% html_attr("href")
       fulllink <- paste0("https://ssb.bannerprod.memphis.edu", classlink)
       
       newpage <-read_html(fulllink)
       #find the tables 
       tables <- newpage %>% html_nodes("table.datadisplaytable") 
       #find the index to the correct table 
       seatingtable <- which(html_attr(tables, "summary") == "This layout table is used to present the seating numbers.")
       size <-tables[seatingtable] %>% html_table(header=TRUE)
       #may want to clean up table before combining in dataframe
       # i.e  size[[1]][1, -1]
       data.frame(class=classname, size[[1]], link=fulllink)
    })
    
    answer <- bind_rows(dfs)
    

    【讨论】:

    • 非常感谢。这很有帮助。
    最近更新 更多