【问题标题】:Average two rows if both conditions are met如果两个条件都满足,则平均两行
【发布时间】:2016-06-05 03:17:07
【问题描述】:

我有一个名为 df 的数据框,看起来像这样;

id face value
1   r   15
1   r   11
1   t   16
1   t   17
2   r   13
2   r   25
2   t   12
2   t   18
3   r   30
3   r   20
3   t   19
3   t   10

因此,如果两个条件都满足,我需要对每一行进行平均。条件是;如果idface 相同,则平均value

例如,如果 id=1face=r 则取平均值 15+11 并将计算值 13 放入新列中。我必须对整个数据框执行此操作(2000 行,500 个不同的id)。

PS; 对于每个face,我必须有不同的列。例如,如果id=1face=r 将平均值value 放入名为newr 的新列中,如果id=2face=r 将平均值value 也放入名为newr 的新列中.然后,如果 id=1face=t 将平均值 value 放入名为 newt 的新列中。并且输出会是这样的;

id face newr newt
1   r    13
1   t        16.5
2   r   19
2   t        15

这是我的str(df1)

Classes ‘data.table’ and 'data.frame':  340 obs. of  26 variables:
 $ id         : int  5 5 5 5 5 5 5 5 7 7 ...
 $ nirid      : chr  "bx5xtx1" "ax5xrx2" "bx5xrx2" "bx5xtx2" ...
 $ group      : Factor w/ 3 levels "a","b","r": 2 1 2 2 2 1 1 1 1 1 ...
 $ section    : Factor w/ 3 levels "","r","t": 3 2 2 3 2 3 2 3 2 3 ...
 $ face       : Factor w/ 3 levels "","1","2": 2 3 3 3 2 2 2 3 2 3 ...
 $ sample     : chr  "B3C-3D" "B3C-3D" "B3C-3D" "B3C-3D" ...
 $ treatment  : chr  "control" "control" "control" "control" ...
 $ width      : num  1 1 1 1 1 ...
 $ thick      : num  1.02 1.02 1.02 1.02 1.02 ...
 $ length     : num  16 16 16 16 16 ...
 $ testweight : num  126 126 126 126 126 ...
 $ maxload    : num  418 418 418 418 418 418 418 418 436 436 ...
 $ loadppl    : num  251 251 251 251 251 251 251 251 258 258 ...
 $ ppldistance: num  0.139 0.139 0.139 0.139 0.139 ...
 $ scmor      : num  0.399 0.399 0.399 0.399 0.399 ...
 $ scmoe      : num  5.53e-05 5.53e-05 5.53e-05 5.53e-05 5.53e-05 ...
 $ failure    : int  2 2 2 2 2 2 2 2 2 2 ...
 $ mcweight   : num  107 107 107 107 107 ...
 $ odweight   : num  94.1 94.1 94.1 94.1 94.1 94.1 94.1 94.1 90.3 90.3 ...
 $ mc         : num  13.3 13.3 13.3 13.3 13.3 ...
 $ sgsc       : num  0.415 0.415 0.415 0.415 0.415 ...
 $ scmorpsi   : num  58 58 58 58 58 ...
 $ scmoepsi   : num  8.03 8.03 8.03 8.03 8.03 ...
 $ rows       : chr  "9" "10" "11" "12" ...
 $ value        :Class 'AsIs'  num [1:238000] 0.0147 -0.0169 -0.0152 0.0135 -0.0107 ...
 $ sg42       :Class 'AsIs'  num [1:235280] 1.86e-04 9.39e-05 8.94e-05 1.83e-04 8.86e-05 ...
 - attr(*, ".internal.selfref")=<externalptr> 

更新

这是使用dput(droplevels(head(data, 20)))的实际数据集的一小部分

structure(list(id = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 
7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L), nirid = c("bx5xtx1", 
"ax5xrx2", "bx5xrx2", "bx5xtx2", "bx5xrx1", "ax5xtx1", "ax5xrx1", 
"ax5xtx2", "ax7xrx1", "ax7xtx2", "ax7xrx2", "ax7xtx1", "ax8xrx2", 
"ax8xtx1", "ax8xrx1", "ax8xtx2", "ax9xtx2", "bx9xtx2", "ax9xrx2", 
"ax9xtx1"), group = c("b", "a", "b", "b", "b", "a", "a", "a", 
"a", "a", "a", "a", "a", "a", "a", "a", "a", "b", "a", "a"), 
    section = c("t", "r", "r", "t", "r", "t", "r", "t", "r", 
    "t", "r", "t", "r", "t", "r", "t", "t", "t", "r", "t"), face = c(1L, 
    2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 
    2L, 2L, 2L, 1L), sample = c("B3C-3D", "B3C-3D", "B3C-3D", 
    "B3C-3D", "B3C-3D", "B3C-3D", "B3C-3D", "B3C-3D", "B3C-1E", 
    "B3C-1E", "B3C-1E", "B3C-1E", "B1C-2D", "B1C-2D", "B1C-2D", 
    "B1C-2D", "A3C-2C", "A3C-2C", "A3C-2C", "A3C-2C"), treatment = c("control", 
    "control", "control", "control", "control", "control", "control", 
    "control", "control", "control", "control", "control", "control", 
    "control", "control", "control", "control", "control", "control", 
    "control"), width = c("1.003", "1.003", "1.003", "1.003", 
    "1.003", "1.003", "1.003", "1.003", "1.021", "1.021", "1.021", 
    "1.021", "1.02", "1.02", "1.02", "1.02", "1.033", "1.033", 
    "1.033", "1.033"), thick = c("1.02", "1.02", "1.02", "1.02", 
    "1.02", "1.02", "1.02", "1.02", "1.043", "1.043", "1.043", 
    "1.043", "1.025", "1.025", "1.025", "1.025", "1.029", "1.029", 
    "1.029", "1.029"), length = c("16", "16", "16", "16", "16", 
    "16", "16", "16", "15.98", "15.98", "15.98", "15.98", "16.016", 
    "16.016", "16.016", "16.016", "16.005", "16.005", "16.005", 
    "16.005"), testweight = c("126", "126", "126", "126", "126", 
    "126", "126", "126", "121.4", "121.4", "121.4", "121.4", 
    "144.1", "144.1", "144.1", "144.1", "119.6", "119.6", "119.6", 
    "119.6"), maxload = c(418L, 418L, 418L, 418L, 418L, 418L, 
    418L, 418L, 436L, 436L, 436L, 436L, 631L, 631L, 631L, 631L, 
    486L, 486L, 486L, 486L), loadppl = c("251", "251", "251", 
    "251", "251", "251", "251", "251", "258", "258", "258", "258", 
    "296", "296", "296", "296", "255", "255", "255", "255"), 
    ppldistance = c("0.1388", "0.1388", "0.1388", "0.1388", "0.1388", 
    "0.1388", "0.1388", "0.1388", "0.155", "0.155", "0.155", 
    "0.155", "0.1412", "0.1412", "0.1412", "0.1412", "0.1488", 
    "0.1488", "0.1488", "0.1488"), scmor = c("0.399330740757585", 
    "0.399330740757585", "0.399330740757585", "0.399330740757585", 
    "0.399330740757585", "0.399330740757585", "0.399330740757585", 
    "0.399330740757585", "0.391336060622532", "0.391336060622532", 
    "0.391336060622532", "0.391336060622532", "0.587001478757759", 
    "0.587001478757759", "0.587001478757759", "0.587001478757759", 
    "0.442958394865818", "0.442958394865818", "0.442958394865818", 
    "0.442958394865818"), scmoe = c("5.5328050375923e-05", "5.5328050375923e-05", 
    "5.5328050375923e-05", "5.5328050375923e-05", "5.5328050375923e-05", 
    "5.5328050375923e-05", "5.5328050375923e-05", "5.5328050375923e-05", 
    "4.6792031310635e-05", "4.6792031310635e-05", "4.6792031310635e-05", 
    "4.6792031310635e-05", "6.2150955161815e-05", "6.2150955161815e-05", 
    "6.2150955161815e-05", "6.2150955161815e-05", "4.9585347590597e-05", 
    "4.9585347590597e-05", "4.9585347590597e-05", "4.9585347590597e-05"
    ), failure = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), mcweight = c("106.6", 
    "106.6", "106.6", "106.6", "106.6", "106.6", "106.6", "106.6", 
    "102.1", "102.1", "102.1", "102.1", "121.9", "121.9", "121.9", 
    "121.9", "100.7", "100.7", "100.7", "100.7"), odweight = c("94.1", 
    "94.1", "94.1", "94.1", "94.1", "94.1", "94.1", "94.1", "90.3", 
    "90.3", "90.3", "90.3", "107.1", "107.1", "107.1", "107.1", 
    "88.3", "88.3", "88.3", "88.3"), mc = c("13.2837407013815", 
    "13.2837407013815", "13.2837407013815", "13.2837407013815", 
    "13.2837407013815", "13.2837407013815", "13.2837407013815", 
    "13.2837407013815", "13.0675526024363", "13.0675526024363", 
    "13.0675526024363", "13.0675526024363", "13.8188608776844", 
    "13.8188608776844", "13.8188608776844", "13.8188608776844", 
    "14.0430351075878", "14.0430351075878", "14.0430351075878", 
    "14.0430351075878"), sgsc = c("0.414649099500969", "0.414649099500969", 
    "0.414649099500969", "0.414649099500969", "0.414649099500969", 
    "0.414649099500969", "0.414649099500969", "0.414649099500969", 
    "0.385028360121945", "0.385028360121945", "0.385028360121945", 
    "0.385028360121945", "0.461392466167132", "0.461392466167132", 
    "0.461392466167132", "0.461392466167132", "0.376174963976185", 
    "0.376174963976185", "0.376174963976185", "0.376174963976185"
    ), scmorpsi = c("57.9580175265", "57.9580175265", "57.9580175265", 
    "57.9580175265", "57.9580175265", "57.9580175265", "57.9580175265", 
    "57.9580175265", "56.79768659253", "56.79768659253", "56.79768659253", 
    "56.79768659253", "85.1961507631", "85.1961507631", "85.1961507631", 
    "85.1961507631", "64.2900427962", "64.2900427962", "64.2900427962", 
    "64.2900427962"), scmoepsi = c("8.0301959907", "8.0301959907", 
    "8.0301959907", "8.0301959907", "8.0301959907", "8.0301959907", 
    "8.0301959907", "8.0301959907", "6.7912962715", "6.7912962715", 
    "6.7912962715", "6.7912962715", "9.0204579335", "9.0204579335", 
    "9.0204579335", "9.0204579335", "7.1967122773", "7.1967122773", 
    "7.1967122773", "7.1967122773"), rows = 9:28, value = c("0.014680833", 
    "-0.0169", "-0.015241563", "0.013507307", "-0.010687351", 
    "0.000479", "-0.0311", "-7.18e-05", "-0.037", "-0.00349", 
    "-0.0395", "-0.000859", "-0.018", "0.000127", "-0.0234", 
    "0.00215", "-0.0165", "-0.0162", "-0.0286", "-0.0214"), sg42 = c("0.000185853584415584", 
    "9.39393939393943e-05", "8.93772943722944e-05", "0.000183087277056277", 
    "8.86156017316018e-05", "0.000180270562770563", "9.02597402597403e-05", 
    "0.0001831779004329", "8.26839826839824e-05", "0.000167605411255411", 
    "8.44155844155841e-05", "0.000175891774891775", "9.1774891774892e-05", 
    "0.000180465367965368", "9.02597402597405e-05", "0.000178874458874459", 
    "0.000160822510822511", "0.000154978354978355", "8.26839826839826e-05", 
    "0.000159090909090909")), .Names = c("id", "nirid", "group", 
"section", "face", "sample", "treatment", "width", "thick", "length", 
"testweight", "maxload", "loadppl", "ppldistance", "scmor", "scmoe", 
"failure", "mcweight", "odweight", "mc", "sgsc", "scmorpsi", 
"scmoepsi", "rows", "value", "sg42"), row.names = c(NA, 20L), class = "data.frame")

预期结果列是newrnewtnewrsg42newtsg42

非常感谢:)

【问题讨论】:

  • 2000 行数据集不是一个大数据集。请检查str 并查看值列是否为数字。
  • @akrun 当我使用str 它说; value :Class 'AsIs' num [1:238000] 0.0147 -0.0169 -0.0152 0.0135 -0.0107
  • 您能否在您的帖子中更新完整的strlist 的情况下可能会出现“AsIs”。
  • @akrun 因为用cmets写太长了,所以我编辑了问题并将str(df1)放在那里
  • @akrun 我改变了我的数据结构并且工作正常。谢谢你:)

标签: r dataframe conditional


【解决方案1】:

这是使用aggregate()reshape() 的解决方案:

df <- data.frame(id=c(1L,1L,1L,1L,2L,2L,2L,2L,3L,3L,3L,3L),face=c('r','r','t','t','r','r','t','t','r','r','t','t'),value=c(15L,11L,16L,17L,13L,25L,12L,18L,30L,20L,19L,10L),stringsAsFactors=F);
reshape(transform(aggregate(value~face+id,df,mean),time=face),dir='w',idvar=c('id','face'));
##   face id value.r value.t
## 1    r  1      13      NA
## 2    t  1      NA    16.5
## 3    r  2      19      NA
## 4    t  2      NA    15.0
## 5    r  3      25      NA
## 6    t  3      NA    14.5

【讨论】:

  • 当我在数据上使用aggregate()reshape() 时,出现以下错误; Error in as.data.frame.default(data, optional = TRUE) : cannot coerce class ""function"" to a data.frame
  • @MaMi 您很可能将df 作为第二个参数传递给aggregate(),就像我在我的解决方案中所做的那样,但是您没有将df 分配给data.frame。如果这是正确的,那么df 将绑定到内置统计包中的df() 函数。您必须将 data.frame 变量作为第二个参数传递给 aggregate(),代码才能正常工作。
  • 感谢问题出在我的数据结构上!我修好了它,它工作了!
【解决方案2】:

如果我们需要“宽”格式的输出,请使用data.table 中的dcast 并将fun.aggregate 指定为mean

library(data.table)
dcast(setDT(df1), id + face ~ paste0("new", face), value.var="value", mean)
#   id face newr newt
#1:  1    r   13  NaN
#2:  1    t  NaN 16.5
#3:  2    r   19  NaN
#4:  2    t  NaN 15.0
#5:  3    r   25  NaN
#6:  3    t  NaN 14.5

或者另一个选项是dplyr/tidyr

library(dplyr)
library(tidyr)
df1 %>% 
  group_by(id, face) %>% 
  summarise(MeanValue = mean(value)) %>% 
  mutate(newface = paste0("new", face)) %>%
  spread(newface, MeanValue)
#    id  face  newr  newt
#  <int> <chr> <dbl> <dbl>
#1     1     r    13    NA
#2     1     t    NA  16.5
#3     2     r    19    NA
#4     2     t    NA  15.0
#5     3     r    25    NA
#6     3     t    NA  14.5

基准测试

set.seed(24)
df1 <- data.frame(id = sample(1:50, 1e7, replace=TRUE), 
                face = sample(letters, 1e7, replace=TRUE),
               value = rnorm(1e7), stringsAsFactors=FALSE)

df2 <- copy(df1)

system.time({
 dcast(setDT(df1), id + face ~ paste0("new", face), value.var="value", mean)    
 })
#  user  system elapsed 
#  1.95    0.01    1.96 
system.time({
 reshape(transform(aggregate(value~face+id,df1,mean),time=face),dir='w',
                     idvar=c('id','face'));
 })
#   user  system elapsed 
#  16.36    1.00   17.38 

数据

df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L), face = c("r", "r", "t", "t", "r", "r", "t", "t", "r", 
"r", "t", "t"), value = c(15L, 11L, 16L, 17L, 13L, 25L, 12L, 
18L, 30L, 20L, 19L, 10L)), .Names = c("id", "face", "value"), 
class = "data.frame", row.names = c(NA, -12L))

【讨论】:

  • 感谢您的快速回复。我错过了我的问题的解释。对于每个“面”,我必须有不同的列。例如,如果id=1face=r 将平均值value 放入名为newr 的新列中,如果id=2face=r 将平均值value 也放入名为newr 的新列中.然后,如果 id=1face=t 将平均值 value 放入名为 newt 的新列中。我编辑了我的问题并将这个解释也放在那里。
  • @MaMi 你能用预期的输出更新你的帖子吗
  • 谢谢,但是当我运行代码时,我得到了这个错误; Error in gmean(nir) : grpn [340] != length(x) [238000] in gsum
  • @MaMi 您展示的示例没有任何错误
  • 反正下午在线。
【解决方案3】:
for( i in unique(df1$id)){
  for(j in unique(df1$face=="r"[df1$id==i])){
      for(l in unique(df1$face == "t"[df1$id==i])){
       df1$newr[df1$id==i & df1$face=="r"] <- mean(df1$value[df1$id==i & df1$face=="r"])
        df1$newt[df1$id==i & df1$face=="t"] <- mean(df1$value[df1$id==i & df1$face=="t"])
     }
   }
}


df1 <- df1[!duplicated(df1[,c("id","face")]),]

> df1
   id face newr newt
1   1    r   13   NA
3   1    t   NA 16.5
5   2    r   19   NA
7   2    t   NA 15.0
9   3    r   25   NA
11  3    t   NA 14.5

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2017-10-06
    • 2013-08-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-07-30
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多