【发布时间】:2020-09-10 01:38:21
【问题描述】:
我想将一个数据框的每一列与另一个数据框列进行比较,并将每个结果重叠打印到单独的文件中。
我从两个测试数据集开始:
df1 <- data.frame("x" = c("a_b", "c_d", "e_f/c_f", "g_h"),
"y" = c(9,2,1,4),
"z" = c(7,5,8,5))
df2 <- data.frame("m" = c("c_f", "x_y"),
"n" = c("a_b", "x_y"))
并使用 for 循环获取结果。
for (i in colnames(df2)){
ccc<-df1[grep(paste(df2[,i], collapse = "|"), df1$x), ]
write.csv(ccc, file = paste(i, ".csv", sep=""))
}
一切看起来都很好。
现在我正在我的完整数据集中尝试相同的循环(下面是修改后的 df1 和 df2):
df1<- structure(list(BGC_Accession = structure(c(1L, 1L, 1L, 2L), .Label = c("BGC0000647",
"BGC0000984"), class = "factor"), Genbank_ID = structure(c(1L,
3L, 2L, 4L), .Label = c("GCA_000202835", "GCA_000219295", "GCA_000964345",
"GCA_003029685"), class = "factor"), BGC_Class = structure(c(2L,
2L, 2L, 1L), .Label = c("NRP/Polyketide", "Terpene"), class = "factor"),
BGC_Start = c(2093957L, 1L, 1L, 2656134L), BGC_End = c(2115021L,
4440L, 4186L, 2721658L), Product = structure(c(1L, 1L, 1L,
2L), .Label = c("Carotenoid", "Delftibactin"), class = "factor"),
Similarity = structure(c(1L, 1L, 1L, 1L), .Label = "100%", class = "factor"),
Species_name = structure(c(1L, 4L, 2L, 3L), .Label = c("Acidiphilium_multivorum",
"Acidiphilium_sp_PM", "Acidovorax_avenae/Acidovorax_avene",
"Acinetobacter_baumannii"), class = "factor"), Kingdom = structure(c(1L,
1L, 1L, 1L), .Label = "k__Bacteria", class = "factor"), Phylum = structure(c(1L,
1L, 1L, 1L), .Label = "p__Proteobacteria", class = "factor"),
Class = structure(c(1L, 1L, 1L, 2L), .Label = c("c__Alphaproteobacteria",
"c__Betaproteobacteria"), class = "factor"), Order = structure(c(2L,
2L, 2L, 1L), .Label = c("o__Burkholderiales", "o__Rhodospirillales"
), class = "factor"), Family = structure(c(1L, 1L, 1L, 2L
), .Label = c("f__Acetobacteraceae", "f__Comamonadaceae"), class = "factor"),
Genus = structure(c(1L, 1L, 1L, 2L), .Label = c("g__Acidiphilium",
"g__Acidovorax"), class = "factor"), Species = structure(c(1L,
1L, 2L, 3L), .Label = c("s__Acidiphilium_multivorum", "s__Acidiphilium_sp_PM",
"s__Acidovorax_avenae"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
df2<- structure(list(Gut_SRS011111 = structure(c(2L, 1L, 1L), .Label = c("",
"Actinobaculum_unclassified"), class = "factor"), Gut_SRS011269 = structure(c(3L,
1L, 2L), .Label = c("Acidiphilium_multivorum", "Acinetobacter_baumannii",
"Clostridium_citroniae"), class = "factor"), Gut_SRS011355 = structure(c(2L,
3L, 1L), .Label = c("", "Acidovorax_avene", "Streptococcus_gordonii"
), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
使用上面的脚本:
for (i in colnames(df2)){
overlap_data<-df1[grep(paste(df2[,i], collapse = "|"), df1$Species_name), ]
write.csv(overlap_data, file = paste(i, ".csv", sep=""))
}
似乎只有三个重叠列中的一个(在 df2 中)给出了正确的结果。 例如,在 df2 的第一列中,与 df1 没有重叠,它应该给出一个空白的结果文件。第二列输出文件看起来不错。在第三个文件中,我应该得到一个重叠,而不是输出文件中给出的四个。
我做错了什么?
感谢您的耐心等待!
【问题讨论】:
-
查看您的正则表达式。这个模式,
"",匹配任何东西。这是一个 MCVE 供您使用grep("a|", letters)
标签: r