R删除字符串中通配符周围的字符答案

【问题标题】：R remove characters surrounding wildcard in stringR删除字符串中通配符周围的字符
【发布时间】：2021-08-29 16:05:27
【问题描述】：

我有一个向量列出了网站中包含 URL 的各种类型的 HTML，其特征在于通配符：([^。到目前为止，我已经能够将链接提取到我需要的数据框中，但是在清理它们以便可以访问它们时遇到了麻烦。

如何在不影响 URL 的情况下删除所有标签？

# Vector of HTML tags surrounding URL
x <- c('\t\t\t<div><a href=\"([^<]*)\">([^<]*)</a></div>','\t\t</tr><tr><td><a href=\"([^<]*)\">([^<]*)</a></td>','\t\t\t<td><a href=\"([^<]*)\">([^<]*)</a></td>')

输入：

URL <- "https://www.atf.gov/resource-center/data-statistics"
html <- paste(readLines(URL))

输出：

Link	Title
"https://www.atf.gov/file/144871/download"	Canada 2014-2019
"https://www.atf.gov/node/79436"	2019

我目前正在使用的代码：

dlall <- list()
for(i in x){
  datalines <- grep(i,html,value=TRUE)
  dl_all <- rbind(data.frame(datalines), data.frame(dl_all))
  }

【问题讨论】：

你的意见是什么？
刚刚添加到 OP
你需要得到什么？网址列表？如果是，来自任何地方或来自<a href="...."> 标签？
是的，我决定添加预期的输出和我目前正在处理的代码。我希望在网页中获取网址，并将其标题放在单独的列中。
试试ideone.com/9YHFkb

标签： html r replace gsub

【解决方案1】：

类似于Wiktor Stribiżew 使用 R >= 4.1：

library(rvest)
url <- "https://www.atf.gov/resource-center/data-statistics"
df <- read_html(url) |> html_nodes("a") |> 
  {\(x) data.frame(
    Link = x |> html_attr("href"),
    Title = x |> html_text())
  }()

给予：

tail(df)
                                                        Link                              Title
203    https://www.justice.gov/jmd/eeo-program-status-report                        No Fear Act
204 https://oig.justice.gov/hotline/whistleblower-protection Whistleblower Rights & Protections
205                        https://www.atf.gov/home/site-map                           Site Map
206 https://www.atf.gov/resource-center/accessibility-policy           Accessibility & Plug-Ins
207                              https://www.atf.gov/<front>                            ATF.gov
208                                  https://www.justice.gov         U.S. Department of Justice

【讨论】：

\(x) 与 R 4.1 中的 function(x) 相同，只是简写形式。