匹配数据框中列的文本答案

【问题标题】：Matching text from columns in Data Frame匹配数据框中列的文本
【发布时间】：2018-05-30 20:15:18
【问题描述】：

我正在寻找出现在搜索字符串中的关键字（在本例中为研究问题）。我想我已经接近了，但我不太确定我遇到了什么问题。我的数据框看起来像这样：

Q1                                                     keywords
How do you assess strategic deterrence messaging?      Deterrence messaging effects perception assessment
An energy transition for green growth                  Energy transition sustainable
Some other research question here                      research keywords topics etc

其中Q1 指的是问题，keywords 是单词列表（在这种情况下，清除了 AND、NOT 和 OR 的布尔搜索）。我要确定的是keywords 中的任何一个是否出现在Q1 字符串中，找到匹配项，并计算这种情况发生的频率（所以我可以说keywords 出现在column1 n% 的时间，在column2 n% 的时间...）。

这是我开始的地方，使用tidyverse：

df_final <- df %>% 
  mutate(matches = str_extract_all(
    Q1,
    str_c(df$keywords, collapse = "|") %>% regex(ignore_case = T)),
    match = map_chr(matches, str_c, collapse = ", "),
    count = map_int(matches, length)
  )

但我没有得到任何匹配。我假设它可能与我的keyword 专栏有关？是否需要将其转换为向量或逗号分隔的列表才能正常工作？提前感谢您的建议！

编辑：dput() 的示例输出：

structure(list(Q1 = c("Assessing the effects of strategic deterrence messaging in the cognitive dimension", 
"How do you assess effects of strategic deterrence messaging?", 
"Determine Strategic Implications of Climate Change to USG/DoD"
), keywords = c("Deterrence messaging effects perception assessment", 
"political philosophy sociology social sciences history marketing power structure government governing class bourgeoisie social class military class ruling class governing class", 
"Climate Change Strategic Global Warming Strategic Climate Change Policy Global Warming Policy"
)), .Names = c("Q1", "keywords"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))

【问题讨论】：

你能用dput()添加一个快速的df示例
完成 -- 在上面添加了一个编辑。

标签： r dplyr stringr

【解决方案1】：

下面的代码将根据问题后面的关键字返回您的 data.frame 以及问题中关键字的出现次数。在您的示例输出中是 3 0 6。所有函数都来自 tidyverse 包。

library(stringr)
library(dplyr)
library(purrr)

df  %>%  mutate(count = map2_int(Q1, keywords, function(x, y) sum(str_detect(str_to_lower(x), str_to_lower(flatten_chr(str_split(y, " ")))))))

# A tibble: 3 x 3
  Q1                                                                                 keywords                                        count
  <chr>                                                                              <chr>                                           <int>
1 Assessing the effects of strategic deterrence messaging in the cognitive dimension Deterrence messaging effects perception assess~     3
2 How do you assess effects of strategic deterrence messaging?                       political philosophy sociology social sciences~     0
3 Determine Strategic Implications of Climate Change to USG/DoD                      Climate Change Strategic Global Warming Strate~     6

数据：

df <- structure(list(Q1 = c("Assessing the effects of strategic deterrence messaging in the cognitive dimension", 
"How do you assess effects of strategic deterrence messaging?", 
"Determine Strategic Implications of Climate Change to USG/DoD"
), keywords = c("Deterrence messaging effects perception assessment", 
"political philosophy sociology social sciences history marketing power structure government governing class bourgeoisie social class military class ruling class governing class", 
"Climate Change Strategic Global Warming Strategic Climate Change Policy Global Warming Policy"
)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"
))

【讨论】：

谢谢！将其应用于整个数据集时出现错误 - 这是文本中出现特殊字符的问题吗？ mutate_impl(.data, dots) 中的错误：评估错误：正则表达式模式中的语法错误。 (U_REGEX_RULE_SYNTAX)。另外：警告消息：1：在 stri_detect_regex(string, pattern, opts_regex = opts(pattern)) 中：不支持空搜索模式 2：在 stri_detect_regex(string, pattern, opts_regex = opts(pattern)) 中：空搜索模式是不支持
在您拥有的关键字列表中，是否存在关键字之间的空格超过1个的记录？特殊字符不应该成为问题。但是如果你能把它缩小到一个例子，我可以看看是否有解决方案。
我想我去掉了多个空格，但我可以仔细检查一下。还有一些没有任何关键字的情况，会抛出错误吗？编辑：需要明确的是，我没有用NA 等替换这些空关键字。但如果这是导致错误的原因，最简单的方法可能就是丢弃这些行。
好的，已经解决了。字符串中一定有特殊字符或某些东西。使用df$keywords <- str_replace_all(df$keywords, "[[:punct:]]", "") 似乎已经解决了这个问题。

【解决方案2】：

可能不是最佳的，但可能会有所帮助。我添加了tolower()，因为我假设您不在乎威慑还是威慑。

a <-tolower(unique(unlist(strsplit(df$keywords, " "))))

dfcounter <- data.frame(table(tolower(unlist(strsplit(df$Q1, " ")))),stringsAsFactors = F)

dfcounter[match(a,dfcounter$Var1,nomatch = 0),]

【讨论】：