【问题标题】:Scrape journal article title from staff web-page从员工网页上抓取期刊文章标题
【发布时间】:2021-11-04 08:42:29
【问题描述】:

我想从所有员工的官方网页上抓取期刊文章的标题和作者。例如

https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah

我试图访问的具体部分是这样的:

我正在关注本指南:https://www.datacamp.com/community/tutorials/r-web-scraping-rvest,但它指的是本网站没有的 HTML 标记。请给我指出正确的方向吗?

【问题讨论】:

  • 哦,这很有趣!要查看出版物,您必须点击底部的“期刊文章”标签。
  • @stevec 有一个名为“symplectic”的出版物数据库,Leeds 使用它保存有关每个工作人员文章的所有元数据,所以我猜想它与此有关。
  • @stevec 是的,它可以使用直接 http 请求来完成(见下文)。我不认为这更聪明,而且可能难以概括。
  • @AllanCameron 你很好地找到了这些请求!我根本无法发现它们。您是否使用了 chrome devtools 网络选项卡?还是有其他我不熟悉的工具?
  • @stevec 我使用 Firefox 开发者面板显示所有 XHR 请求。我必须经常做这种事情,所以我习惯于根据正确的请求进行归位。

标签: html r web-scraping


【解决方案1】:

页面使用返回json 对象的XHR 调用动态加载这些引用。在这种情况下,我们可以复制查询并自己解析 json 以获得发布列表:

library(httr)
library(rvest)
library(jsonlite)

url <- paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
       "uniqueid=00970757",
       "&tries=0", 
       "&hash=f6a214dc99686895d6bf3de25507356f", 
       "&citationStyle=1")

GET(url) %>% 
  content("text") %>%
  fromJSON() %>%
  `[[`("publications") %>%
  `[[`("journal_article") %>%
  lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
  unlist() %>%
  as.character()
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"                        
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"                                                                                                
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"                                                                 
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"                                                                  
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"                         
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"                                         
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"                                                                                   
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"

更新

通过一些文本解析,可以从教员主页的html中获取json url:

get_json_url <- function(url)
{
   carveout <- function(string, start, end)
   {
      string %>% strsplit(start) %>% `[[`(1) %>% `[`(2) %>%
                 strsplit(end)   %>% `[[`(1) %>% `[`(1)
   }
   
   params <- GET(url) %>% 
      content("text") %>% 
      carveout("var dataGetQuery = ", ";")
   
   id <- carveout(params, "uniqueid: '", "'")
   tries <- carveout(params, "tries: ", ",")
   hash <- carveout(params, "hash: '", "'")
   citationStyle <- carveout(params, "citationStyle: ", "\n")

   paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
          "uniqueid=", id,
          "&tries=", tries, 
          "&hash=", hash,
          "&citationStyle=", citationStyle)
}

允许:

url <- "https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah"

get_json_request(url)
#> [1] "https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?uniqueid=00970757&tries=0&hash=f7266eb42b24715cfdf2851f24b229c6&citationStyle=1"

而且,如果您希望能够只使用lapply 一个主页网址向量来获得最终的发布列表:

publications_from_homepage <- function(url)
{
   get_json_request(url) %>%
   GET() %>% 
     content("text") %>%
     fromJSON() %>%
     `[[`("publications") %>%
     `[[`("journal_article") %>%
     lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
     unlist() %>%
     as.character()
}

所以你有:

publications_from_homepage(url)
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"                        
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"                                                                                                
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"                                                                 
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"                                                                  
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"                         
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"                                         
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"                                                                                   
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"

reprex package (v2.0.0) 于 2021 年 11 月 4 日创建

【讨论】:

  • 非常感谢您这么快的回复,真是一种享受!您是如何找到网址的所有部分的?我有一长串不同研究人员 (eps.leeds.ac.uk/civil-engineering/stafflist) 的名单,因此我将其列在一个列表中并使用 lapply。
  • @HCAI 查看我的更新,这样您就可以构建 json url,而无需在开发人员选项卡中找到它或使用 Selenium。这应该允许您快速lapply 教师主页网址的矢量,这应该很容易从eps.leeds.ac.uk/civil-engineering/stafflistrvest
【解决方案2】:

这是一种 RSelenium 方法

library(RSelenium)
library(rvest)
library(xml2)

#setup driver, client and server
driver <- rsDriver( browser = "firefox", port = 4545L, verbose = FALSE ) 
server <- driver$server
browser <- driver$client

#goto url in browser
browser$navigate("https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah")

#get all relevant titles
doc <- xml2::read_html(browser$getPageSource()[[1]])
df <- data.frame( title = 
                    xml2::xml_find_all(doc, '//span[@class="title-with-parent"]') %>%
                    xml2::xml_text() )

#close everything down properly
browser$close()
server$stop()
# needed, else the port 4545 stays occupied by the java process
system("taskkill /im java.exe /f", intern = FALSE, ignore.stdout = FALSE)

【讨论】:

  • 非常感谢您对此的帮助!我接受了 Allan 的回答,因为对于抓取来说非常陌生,它似乎更容易阅读,但我说我可以很容易地用一长串 url 实现你的方法......当它可用时,我想给你一些赏金为您提供帮助。
  • 驱动程序
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-02-16
  • 1970-01-01
  • 2019-01-05
相关资源
最近更新 更多