如何将数据框与列表进行比较，并在与列表匹配的数据框中返回值？答案

【问题标题】：How to compare a data frame to a list, and return values in the data frame matching the list?如何将数据框与列表进行比较，并在与列表匹配的数据框中返回值？
【发布时间】：2018-02-06 00:17:31
【问题描述】：

总的新手 R 问题。我有一个 ID 和注释的数据框 df：

ID    Notes
1     dogs are friendly
2     dogs and cats are pets
3     cows live on farms
4     cats and cows start with c

我还有另一个值“动物”列表

cats
cows

我想在我的数据框中添加另一列“匹配”，其中包含注释中的所有动物，例如

ID    Notes                        Matches
1     dogs are friendly            
2     dogs and cats are pets       cats
3     cows live on farms           cows
4     cats and cows start with c   cats, cows

到目前为止，我唯一的运气是使用 grepl 如果有任何匹配项返回：

grepl(paste(animals,collapse="|"),df$Notes,ignore.case = T)

如何改为返回值？

更新
在我的数据框中有一些行，我有多个猫的实例，例如，在我的笔记中：

ID    Notes                             Matches
1     dogs are friendly            
2     dogs and cats are pets            cats
3     cows live on farms                cows
4     cats and cats cows start with c   cats, cows

我只想返回一个匹配实例。 @LachlanO 让我非常接近他的解决方案，但我明白了：

[1] "NA, NA"                      "cats, NA"                    "NA, cows"                    "c(\"cats\", \"cats\"), cows"

如何只返回不同的匹配项？

【问题讨论】：

试试stringr::str_extract_all insted of grepl。
或类似：df$Matches <- sapply(strsplit(tolower(df$Notes), " "), function(x) toString(intersect(x, animals)))

标签： r grepl

【解决方案1】：

编辑：添加了unique 操作来处理重复匹配。

我可以让你开始，然后给你指明一个方向:)

下面使用 stringr::str_extract_all 来提取我们需要的相关位，但不幸的是它给我们留下了我们不需要的位，尤其是当它为空白时。我们自定义函数中间的unique 函数只是确保我们逐个元素地获取唯一的匹配项。

ID = seq(1,4)
Notes <- c(
  "dogs are friendly",
  "dogs and cats are pets",
  "cows live on farms",
  "cats and cows start with c "
)
df <- data.frame(ID, Notes)

animals = c("cats", "cows")

matches <- as.data.frame(sapply(animals, function(x){sapply(stringr::str_extract_all(df$Notes, x), unique)}, simplify = TRUE))
matches[matches == "character(0)"] <- NA

apply(matches, 1, paste, collapse = ", ")
[1] "NA, NA"     "cats, NA"   "NA, cows"   "cats, cows"

您可以将其设置为您的额外列，但由于这些 NA，这并不好。如果有一个忽略 NA 的粘贴函数，我们就会被设置。

幸运的是另一个用户已经解决了这个问题:) Check out this answer here.

结合以上应该会给你一个合适的解决方案！

【讨论】：

这让我非常接近——除了在我的数据集中，我可以多次提及一个单词，并且只想看到一个匹配项。有关示例，请参见我的编辑。谢谢！
@epr8n 在paste() 之前插入unique()。
@Gregor 在尝试时获取“unique() 仅适用于向量”：apply(matches, 1, unique(paste), collapse = ", ")
apply(matches, 1, function(x) paste(unique(x), collapse = ",")
@Gregor 不幸的是，这仍然留给我 [1] "NA" "cats,NA" "NA,cows" "c(\"cats\", \"cats\"),cows"

【解决方案2】：

我会这样做：

animals = c("cats", "cows")
reg = paste(animals, collapse = "|")

library(stringr)
matches = str_extract_all(Notes, reg)
matches = lapply(matches, unique)
matches = sapply(matches, paste, collapse = ",")

df$matches = matches
df
#   ID                       Notes   matches
# 1  1           dogs are friendly          
# 2  2      dogs and cats are pets      cats
# 3  3          cows live on farms      cows
# 4  4 cats and cows start with c  cats,cows

如果您想花哨，请在正则表达式上粘贴单词边界，例如 reg = paste("\\b", animals, "\\b", collapse = "|") 以避免提取单词中间。

使用 LachlanO 提供的数据：

ID = seq(1,4)
Notes <- c(
  "dogs are friendly",
  "dogs and cats are pets",
  "cows live on farms",
  "cats and cows start with c "
)
df <- data.frame(ID, Notes)

【讨论】：

谢谢@Gregor，正是我所需要的。

【解决方案3】：

您可以使用gsub一次获得所有动物：

gsub(".*?(cows|cats )|.*","\\1",do.call(paste,df),perl = T)
[1] ""          "cats "     "cows"      "cats cows"

这样写在一个车道上：

transform(df,matches=gsub(".*?(cows|cats )|.*","\\1",do.call(paste,df),perl = T))
  ID                       Notes   matches
1  1           dogs are friendly          
2  2      dogs and cats are pets     cats 
3  3          cows live on farms      cows
4  4 cats and cows start with c  cats cows

【讨论】：