循环 gsub() 将多个单词列表中的元素替换到语料库中答案

【问题标题】：Looping gsub() replacing elements from a list of multiple words into a corpus循环 gsub() 将多个单词列表中的元素替换到语料库中
【发布时间】：2019-09-28 12:30:38
【问题描述】：

我有一个包含 233 个文档的语料库 (ecb_corpus) 和一个多词列表 (ecb_final)。我想将我的多词列表中的每个brigram和trigram替换到我的语料库中。

这是我的多词列表：

1   euro_area
2   monetary_policy
3   price_stability
4   interest_rates
5   second_question
6   medium_term
7   first_question
8   central_banks
9   inflation_expectations
10  structural_reforms

我只是设法通过使用 gsub 为一个案例做到这一点：

ecb_ready <- gsub(pattern = "interest rate", replacement= "interest_rates", ecb_corpus, ignore.case = TRUE, perl = FALSE, fixed = TRUE)

为了得到我想要的结果，在 pattern 中应该有语料库的任何单词 (ecb_corpus) 并在 replacement 我的列表中多词（ecb_final）。我一直在尝试完全不成功的循环（对 R 来说很新，不幸的是还不能这样做）。

有谁可以帮我循环一下吗？

非常感谢！

【问题讨论】：

我不确定我是否清楚地了解您要做什么。您能否通过包含一个竞争示例和预期输出来使这个简短而完整？
@RonakShah 请看下面DHW给出的答案。谢谢。

标签： r loops gsub

【解决方案1】：

stringr::str_replace_all() 可以直接这样做。这就是帮助文件试图与“矢量化string、pattern 和replacement”进行简单沟通的内容。

这里我假设你的语料库存储在一个字符向量中，但它也可以是一个字符列表。如果它更复杂（例如它在 JSON 中...），那么您可能需要在将其提供给 str_replace_all() 之前进行一些预处理。

请注意，结果会删除输入元素的名称，但很容易恢复它们。

library(tidyverse)

(ecb_corpus <- c(
  doc_1 = c("lorem ipsum interest rate gobbledygook"),
  doc_2 = c("lorem dolor central bank foobar")
))
#>                                    doc_1 
#> "lorem ipsum interest rate gobbledygook" 
#>                                    doc_2 
#>        "lorem dolor central bank foobar"

replacements <- c("euro_area",
                  "monetary_policy",
                  "price_stability",
                  "interest_rates",
                  "second_question",
                  "medium_term",
                  "first_question",
                  "central_banks",
                  "inflation_expectations",
                  "structural_reforms")

targets <- replacements %>% str_replace_all("_", " ") %>% str_remove("s$")

(replacement_pairs <- replacements %>% set_names(targets))
#>                euro area          monetary policy          price stability 
#>              "euro_area"        "monetary_policy"        "price_stability" 
#>            interest rate          second question              medium term 
#>         "interest_rates"        "second_question"            "medium_term" 
#>           first question             central bank    inflation expectation 
#>         "first_question"          "central_banks" "inflation_expectations" 
#>        structural reform 
#>     "structural_reforms"

(ecb_ready <- ecb_corpus %>% str_replace_all(replacement_pairs))
#> [1] "lorem ipsum interest_rates gobbledygook"
#> [2] "lorem dolor central_banks foobar"

^{由reprex package (v0.3.0) 于 2019 年 9 月 28 日创建}

【讨论】：