【发布时间】:2021-02-16 17:37:42
【问题描述】:
我在一个向量中收集了以下网址
departments<- c("https://www.jurinst.su.se/english/about-us/contact/researchers-teachers",
"https://www.jurinst.su.se/english/about-us/contact/doctoral-students",
"https://www.buv.su.se/english/research/our-researchers/researchers-child-and-youth-studies",
"https://www.buv.su.se/english/research/our-researchers/researchers-children-s-culture",
"https://www.buv.su.se/english/research/our-researchers/researchers-early-childhood-education",
"https://www.buv.su.se/english/research/our-researchers/researchers-schoolage-educare",
"https://www.edu.su.se/english/about-us/organisation/researchers-faculty-members",
"https://www.edu.su.se/english/about-us/organisation/phd-students",
"https://www.psychology.su.se/english/about-us/contact/staff-a-z",
"https://www.su.se/publichealth/english/about-us/our-staff",
"https://www.sbs.su.se/english/research/research-sections/accounting/faculty",
"https://www.sbs.su.se/english/research/research-sections/finance/people",
"https://www.sbs.su.se/english/research/research-sections/management/faculty",
"https://www.sbs.su.se/english/research/research-sections/marketing/faculty",
"https://www.sofi.su.se/english/staff/all-staff",
"https://www.astro.su.se/english/about-us/contact/2.16629",
"https://www.mnd.su.se/english/research/mathematics-education/researchers",
"https://www.mnd.su.se/english/research/science-education/researchers",
"https://www.mnd.su.se/english/research/mathematics-education/graduate-students",
"https://www.mnd.su.se/english/research/science-education/graduate-students",
"https://www.fysik.su.se/english/about-us/contact/contact-list-alphabetical",
"https://www.dbb.su.se/about-us/contact",
"https://www.mmk.su.se/about-us/units-and-staff/people-at-mmk",
"https://www.su.se/mbw/about-us/staff/all-contacts",
"https://www.aces.su.se/staff/",
"https://www.su.se/geo/english/about-us/contact/staff",
"http://www.bergianska.se/english/about-us/contact-us/staff",
"https://www.nordita.org/people/zebra/index.php")
就 xpath 而言,这些 url 相似但不相同。我正在尝试使用 jsonlite 创建一个能够下载所有人员姓名和电子邮件地址的循环。 但是,如下例所示,我也在处理单点 URL 时遇到错误。你有更好的代码想法吗?谢谢
url.1=departments[1]
json.content <- read_html(url.1) %>% html_node('body') %>% html_text() %>%
jsonlite::fromJSON(simplifyVector = FALSE)
【问题讨论】:
-
但是 Giulia,我尝试了前 3 个案例,但没有一个获得 json。相反,它们会从 css 的“body”节点获取纯文本。事实上,至少对于第一个,更合适的路径是
xpath = '//div/div/ul/li[@class = "profiles borderboxify"]' -
谢谢@NicolásVelásquez。我认为这可能是一种避免编写特定于每个 url 的代码的方法。您对无需指定特定 xpath 即可适用于所有链接的代码有什么建议吗?网址中的页面具有相似的形状,所以我想知道是否可以这样做,或者是否有适合的 R 包...
-
这样不同格式和布局的单一代码?不是我能想出来的。最多我认为您可以通过过滤器提取 html 中的所有电子邮件 - 甚至不限于人员列表或表格 - 因为它们都有一个 XPath href 属性,其中包括“mailto:email@something”。
标签: r web-scraping rvest jsonlite