对于唯一字段 1，折叠另一个字段中的非唯一条目答案

【问题标题】：For a unique field 1, collapse non-unique entries in another field对于唯一字段 1，折叠另一个字段中的非唯一条目
【发布时间】：2013-10-08 11:55:54
【问题描述】：

我有一个数据集，它是两个数据集的左外连接交集。我现在有来自第一个数据集的多个条目，每个条目与第二个重叠。请注意，Assembly.1000 重复了 3 次，我想将其折叠为 1

Assembly.1000 chrX 560000 575000 ABC1   20
Assembly.1000 chrX 560000 575000 IL15RA 3.2
Assembly.1000 chrX 560000 575000 BRCA1  20
Assembly.1038 chrX 780000 829000 .      .
Assembly.1338 chrX 960000 999000 ACTIN  3800
Assembly.1338 chrX 960000 999000 ACTIN  4000

如您所见，对于每个文件 2 条目（ABC1、IL15RA、BRCA1），Assembly.1000 的文件 1 条目重复了 3 次

我想把输出解析成什么

Assembly.1000 chrX 560000 575000 ABC1;IL15RA;BRCA1   20;3.2;20
Assembly.1038 chrX 780000 829000 .      .
Assembly.1338 chrX 960000 999000 ACTIN,ACTIN 3800;4000

我可以使用 $ while read 命令并查看循环中的先前条目来完成此操作，但对于大文件（~1e6 个条目），这根本不够有效。有人对如何有效地编程有任何建议吗？

【问题讨论】：

查看aggregate 或查看“data.table”包并使用paste 聚合列。但这最终会使您的数据很难在以后使用。

标签： linux r bash while-loop

【解决方案1】：

假设您的data.frame 被称为“mydf”，定义如下：

mydf <- structure(list(V1 = c("Assembly.1000", "Assembly.1000", 
    "Assembly.1000", "Assembly.1038", "Assembly.1338", "Assembly.1338"), 
    V2 = c("chrX", "chrX", "chrX", "chrX", "chrX", "chrX"), 
    V3 = c(560000L, 560000L, 560000L, 780000L, 960000L, 960000L), 
    V4 = c(575000L, 575000L, 575000L, 829000L, 999000L, 999000L), 
    V5 = c("ABC1", "IL15RA", "BRCA1", ".", "ACTIN", "ACTIN"), 
    V6 = c("20", "3.2", "20", ".", "3800", "4000")), 
    .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), 
    class = "data.frame", row.names = c(NA, -6L))
mydf
#              V1   V2     V3     V4     V5   V6
# 1 Assembly.1000 chrX 560000 575000   ABC1   20
# 2 Assembly.1000 chrX 560000 575000 IL15RA  3.2
# 3 Assembly.1000 chrX 560000 575000  BRCA1   20
# 4 Assembly.1038 chrX 780000 829000      .    .
# 5 Assembly.1338 chrX 960000 999000  ACTIN 3800
# 6 Assembly.1338 chrX 960000 999000  ACTIN 4000

这是aggregate 方法：

aggregate(cbind(V5, V6) ~ ., mydf, paste, collapse = "; ")
#              V1   V2     V3     V4                  V5          V6
# 1 Assembly.1000 chrX 560000 575000 ABC1; IL15RA; BRCA1 20; 3.2; 20
# 2 Assembly.1038 chrX 780000 829000                   .           .
# 3 Assembly.1338 chrX 960000 999000        ACTIN; ACTIN  3800; 4000

这是“data.table”方法，使用相同的“mydf”作为起点：

library(data.table)
DT <- data.table(mydf)
DT[, lapply(.SD, paste, collapse = "; "), by = c("V1", "V2", "V3", "V4")]
#               V1   V2     V3     V4                  V5          V6
# 1: Assembly.1000 chrX 560000 575000 ABC1; IL15RA; BRCA1 20; 3.2; 20
# 2: Assembly.1038 chrX 780000 829000                   .           .
# 3: Assembly.1338 chrX 960000 999000        ACTIN; ACTIN  3800; 4000

【讨论】：

+1 因为无论需要执行连接的列数如何，这都应该有效。
@Codoremifa，我认为我们需要在基本 R 方法和 data.table 方法中指定的内容不同。
您的意思是像 V5 和 V6 与 V1 - V4 相比？这样，当“by”列是一个非常大的数字时，基本 R 方法看起来更容易编写，而当有大量列要连接时，data.table 方法看起来更容易。但你是对的，它们的工作方式很有趣。
@Codoremifa，是的。但是，当涉及大量列时，我倾向于以某种方式使用setdiff 来帮助并减少输入错误的可能性。
非常感谢这个作品，但你能解释一下 cbind(V5, V6) ~ . 我想我不明白。

【解决方案2】：

按照@AnandaMahto 的建议使用 data.table，但语法稍微简单一些。

library(data.table)

dataset <- data.table(
   a1 = c(1,1,3,3,5,5),
   b1 = c(1,1,3,3,5,5),
   c1 = c("a","b","c","d","e","f"),
   d1 = c("a","b","c","d","e","f")
)

dataset2 <- dataset[,
   list(
      c1d1 = paste(c1,d1, sep = "", collapse = "")
      d1 = paste(d1, collapse = ""),
      c1 = paste(c1, collapse = "")
   ),
   by = c("a1","b1")
]


#> dataset
#   a1 b1 c1 d1
#1:  1  1  a  a
#2:  1  1  b  b
#3:  3  3  c  c
#4:  3  3  d  d
#5:  5  5  e  e
#6:  5  5  f  f
#> dataset2
#   a1 b1 c1d1 d1 c1
#1:  1  1 aabb ab ab
#2:  3  3 ccdd cd cd
#3:  5  5 eeff ef ef

【讨论】：

这会将c1 和d1 中的所有内容放入一个单个列。我不确定这是 OP 想要的。
也添加了其他选项。
哈哈，你这狡猾的狗你。 :)