如何仅保留特定标签后的文本并插入其他行 0答案

【问题标题】：How to keep only the text after a specific tag and insert to other rows 0如何仅保留特定标签后的文本并插入其他行 0
【发布时间】：2020-12-01 08:47:14
【问题描述】：

数据

data.frame(id = c(1, 2), text = c("something here <h1>my text</h1> also <h1>Keep it</h1>", "<h1>title</h1> another here"))

如何在此标记<h1>my text</h1> 之后保留文本，直到找到标记的下一个开头，如果该标记不存在于行中插入 0

示例输出

data.frame(id = c(1, 2), text = c("also", 0))

【问题讨论】：

我可以看到你已经问了几个关于使用这些标签进行操作的问题 - 你应该接受你喜欢的作为答案。

标签： r quanteda

【解决方案1】：

在正则表达式中，您可以使用前瞻和后瞻，请参阅this link 了解更多信息。用命名数据df：

df$text <- str_extract(df$text, pattern = "(?<=</h1>)(.*)(?=<h1>)")
ifelse(is.na(df$text), "0", trimws(df$text))

[1] "also" "0"

【讨论】：

【解决方案2】：

您可以在 quanteda 中使用多个corpus_select() 调用来做到这一点：

df <- data.frame(
  id = c(1, 2),
  text = c(
    "something here <h1>my text</h1> also <h1>Keep it</h1>",
    "<h1>title</h1> another here"
  )
)

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.1.1

corp <- df %>%
  corpus(docid_field = "id") %>%
  corpus_segment("<h1>my text</h1>", pattern_position = "before") %>%
  corpus_segment("<h1>", pattern_position = "after")

现在我们可以通过将它与 ID 序列合并并将任何不匹配的 (NAs) 转换为 0 来获取您的 0：

library("dplyr", warn.conflicts = FALSE)
convert(corp, to = "data.frame") %>%
  rename(id = doc_id) %>%
  select(id, text) %>%
  mutate(id = as.integer(id)) %>%
  right_join(data.frame(id = 1:2)) %>%
  tidyr::replace_na(list(text = 0))
## Joining, by = "id"
##   id text
## 1  1 also
## 2  2    0

【讨论】：