如何从一个数据框中的模式中获取另一个数据框？答案

【问题标题】：How to grep from patterns in one dataframe for another?如何从一个数据框中的模式中获取另一个数据框？
【发布时间】：2020-10-31 04:19:43
【问题描述】：

我有一个只是基因列表的数据集：

Genes
Gene1
Gene2
Gene3
Gene4
Gene5

每当提到这些基因中的任何一个时，我都希望找到并从另一个数据集中提取。

我的其他数据集看起来像

Study ID   Title                  Drug        ...
1         Study of Gene1         Gene1-drug
2         Study of Gene10        Gene10-drug
3         Study of something     Gene4-drug

我希望在我的第二个数据集的任何列中出现任何基因时提取。

我很难找到一个足够相似的问题来重复使用，尽管我知道有很多类似的问题，但我遗漏了一些东西，我发现的大多数示例都有特定的 grep 模式。

到目前为止我一直在尝试：

test = df[apply(df, 1, function(i) any(stringr::str_detect(i, fixed(genelist)))),]

这输出 0 行，但我知道有些行在提到基因的地方有部分匹配。如何修改它以从基因列表数据框中提取和搜索基因？

【问题讨论】：

标签： r stringr

【解决方案1】：

我建议你的方法是purrr。

将数据框的每一行作为一个字符串
在每一行中检测df_genes$Genes 中的单词是否存在
总结结果

library(stringr)
library(purrr)

rows <- pmap(df, str_c, sep = " ") %>% 
  map(str_detect, paste0('\\b', df_genes$Genes, '\\b')) %>% 
  map_lgl(any)
df[rows,]
#>   Study_ID              Title       Drug
#> 1        1     Study of Gene1 Gene1-drug
#> 3        3 Study of something Gene4-drug

paste0 + \\b 的想法来自this great answer

输入数据：

df_genes <- data.frame(Genes = c("Gene1",
                                 "Gene2",
                                 "Gene3",
                                 "Gene4",
                                 "Gene5"))

df <- data.frame(Study_ID = 1:3,
                 Title = c("Study of Gene1",
                           "Study of Gene10",
                           "Study of something"),
                 Drug = c("Gene1-drug",
                          "Gene10-drug",
                          "Gene4-drug"))

检查在每一行中找到了哪些基因：

pmap(df, str_c, sep = " ") %>% 
  map(str_detect, paste0('\\b', df_genes$Genes, '\\b')) %>% 
  map(~keep(df_genes$Genes, .))
#> [[1]]
#> [1] "Gene1"
#> 
#> [[2]]
#> character(0)
#> 
#> [[3]]
#> [1] "Gene4"

【讨论】：

再次感谢您的帮助！这回答了我的问题，如果可能的话，你知道是否有办法从我的列表中找到哪些基因在数据集中被识别为匹配项？如果没有，请不要担心，我现在就自己去尝试一下，感谢您的帮助！
我发誓：我今天不是故意回答你所有的问题啊哈。不管怎样，看看我上次的编辑。
谢谢，无论如何我都非常感谢！我还有另一个也是最后一个问题：我刚刚意识到我需要让它只选择大写的匹配项（我的真实基因都是大写的）。我试过matchedgenestest <- pmap(df, str_c, sep = " ") %>% map(grepl, paste0('\\b', genelist$Gene, '\\b', ignore.case = F)) %>% map(~keep(genelist$Gene, .)) 但这不起作用
将所有内容设置为更低。这更容易。