使用 data.table 条件计算列中的 SUM答案

【问题标题】：Conditional calculating SUM in column using data.table使用 data.table 条件计算列中的 SUM
【发布时间】：2014-10-23 14:31:04
【问题描述】：

这是问题How to output duplicated rows的继续

我有桌子：

x1  x2  x3  x4
34  14  45  53 
2   8   18  17
34  14  45  20
19  78  21  48 
2   8   18  5

您会注意到第 1 行和第 3 行非常相似，除了最后一列。如何计算第 3 列 (53+20) 中这些值的总和，并且只保留这两个相似行之一，但使用 data.table

输出应该是：

x1  x2  x3  x4
34  14  45  73
2   8   18  22

【问题讨论】：

标签： r merge data.table plyr

【解决方案1】：

试试

library(data.table)
nm1 <-paste0("x",1:3)
setDT(df)[df[, duplicated(.SD)|duplicated(.SD,fromLast=TRUE), 
                 .SDcols=nm1]][, list(x4=sum(x4)), by=list(x1,x2,x3)]
#   x1 x2 x3 x4
#1: 34 14 45 73
#2:  2  8 18 22

或者

DT <- data.table(df)
setkey(DT,x1,x2,x3)
DT[duplicated(DT)|duplicated(DT,fromLast=TRUE)][, 
                 list(x4=sum(x4)), by=list(x1,x2,x3)]

#   x1 x2 x3 x4
#1:  2  8 18 22
#2: 34 14 45 73

更新

如果 - 和 '' 位于其他数字列中，那么我们可以使用 as.numeric 并将它们强制为 NA 并发出警告。例如

 dat <- data.frame(Col1= c(3, '', 2:5), Col2=c(4, 5, '-', 2, 6, 8),
           stringsAsFactors=FALSE)

 dat[] <- lapply(dat, as.numeric)
 #Warning message:
 #In lapply(dat, as.numeric) : NAs introduced by coercion
 dat
 # Col1 Col2
 #1    3    4
 #2   NA    5
 #3    2   NA
 #4    3    2
 #5    4    6
 #6    5    8

或者您可以在读取数据集时指定它。使用保存在文件中的相同数据

read.table('fileNew.txt', sep=',',  header=TRUE, na.strings=c('', '-'))
#   Col1 Col2
#1    3    4
#2   NA    5
#3    2   NA
#4    3    2
#5    4    6
#6    5    8

数据

df <- structure(list(x1 = c(34L, 2L, 34L, 19L, 2L), x2 = c(14L, 8L, 
14L, 78L, 8L), x3 = c(45L, 18L, 45L, 21L, 18L), x4 = c(53L, 17L, 
20L, 48L, 5L)), .Names = c("x1", "x2", "x3", "x4"), class = "data.frame", row.names = c(NA, 
-5L))

【讨论】：

您能告诉我如何将我的数据集中的“-”和“”替换为 NA 吗？谢谢！