比较 R 中的 2 个数据帧：在 df2$V2 中从 df1$V2 中搜索字符串并在 df2$V1 中返回字符串 [重复]答案

【问题标题】：Comparing 2 dataframes in R: Searching a string from df1$V2 in df2$V2 and returning string in df2$V1 [duplicate]比较 R 中的 2 个数据帧：在 df2$V2 中从 df1$V2 中搜索字符串并在 df2$V1 中返回字符串 [重复]
【发布时间】：2026-02-02 01:25:01
【问题描述】：

我正在尝试比较 R 中的 2 个数据帧：

Keggs <- c("K001", "K002", "K003", "K004", "K005", "K006", "K007", "K008")
names <- c("Acaryochloris", "Proteobacteria", "Parvibaculum", "Alphaproteobacteria", "Rhodospirillum", "Magnetospirillum", "Coraliomargarita", "Bacteria")
family <- c("Proteos", "Cyanobacteria", "Rhizo", "Nostocales", "Bacteroidetes")
species <- c("Alphaproteobacteria", "Purrsia", "Parvibaculum", "Chico", "Rhodospirillum")
res <- data.frame(Keggs, names)
result <- data.frame(family, species)

现在，我想做的是将结果$species 中的每个字符串与res$names 进行比较。

如果匹配，我希望它返回同一行的result$family 中的字符串，以及 res$Keggs 中的字符串，作为单独的数据帧。

那么最终结果会是这样的：

> df3
Keggs family
K003  Rhizo
K004  Proteos
K005  Bacteroidetes

我搜索了如何比较 R 中的 data.frames，我发现最接近的是： compare df1 column 1 to all columns in df2 returning the index of df2

但这会返回 T/F 并且 res df 是 2 列。

然而，在我的搜索中，我遇到了在基础 R 中使用 match() 和 merge() 函数；我正在使用 11,000,000 行的“res”df，而我的“result”df 少于 1,000 行。在比赛文档中它指出：match(x, table, ...) 并在表下：“不支持长向量”所以，我不认为 match() 或 merge() （由于我的实际 df 的绝对大小）方法是最优雅的。我尝试了一个循环，但我的循环技能有限并且认输了。

如果能对这个难题有任何见解，我将不胜感激。

提前谢谢你，普鲁西亚

【问题讨论】：

您是否真的尝试过match 通话？ 1e7 可能看起来很大，但我认为您可能误解了 R 的“长向量”是什么。在控制台上输入 news()，向下滚动到“长向量”，然后阅读。
你试过merge(res, result, by.x="names", by.y="species")吗？
r2evens：首先，感谢您的 news()。我不知道这件事。很棒的工具。我确实读过：2 ^ 31。所以，我在我的极限范围内很好。抱歉，我确实尝试了以下命令：matched <- data.frame(kegg = res$Keggs, family=result[match(result$species, res$V7), 2])。由于行数大小不同，最初出现错误。

标签： r string dataframe matching

【解决方案1】：

你可以试试tidyverse的功能：

df3 <- res %>% 
  inner_join(result, by = c("names" = "species")) %>%
  select(Keggs, family)

给了

  Keggs        family
1  K003         Rhizo
2  K004       Proteos
3  K005 Bacteroidetes

【讨论】：

一开始一直提示找不到%>%函数，但是在这个网站上搜索后发现我必须附加dplyr包。它工作得很好。谢谢你，阿拉米斯。
:) piping 运算符 % 主要来自 magrittr 包，但 tidyverse 方便地包括 dplyr 和基本管道运算符。
这是很棒的信息。学会了。非常感谢！ :)

【解决方案2】：

我们可以使用data.table

library(data.table)
na.omit(setDT(res)[result, on = c("names" = "species")])[, names := NULL][]
#   Keggs        family
#1:  K004       Proteos
#2:  K003         Rhizo
#3:  K005 Bacteroidetes

【讨论】：

na.omit 是一个非常低效的函数，你可以指定, nomatch = 0L