【问题标题】:How to remove rows matching criteria and rows adjacent to them如何删除匹配条件的行和与其相邻的行
【发布时间】:2016-11-10 20:20:42
【问题描述】:

我有以下示例数据:

data <- data.table(ID = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4), 
                 date = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6), 
                score = c(4,3,2,2,4,1,5,5,5,2,1,4,2,1,5,5,5,3,5,5,5,2,4,5))

   ID date score
 1:  1    1     4
 2:  1    2     3
 3:  1    3     2
 4:  1    4     2
 5:  1    5     4
 6:  1    6     1
 7:  2    1     5
 8:  2    2     5
 9:  2    3     5
10:  2    4     2
11:  2    5     1
12:  2    6     4
13:  3    1     2
14:  3    2     1
15:  3    3     5
16:  3    4     5
17:  3    5     5
18:  3    6     3
19:  4    1     5
20:  4    2     5
21:  4    3     5
22:  4    4     2
23:  4    5     4
24:  4    6     5
    ID date score

我想删除某些行并更改其他行,部分基于它们在表格中的位置。我有两个标准,每个ID

  1. 如果一行有date == 1score == 5,我想删除该行以及紧跟在该行之后的所有后续行以及score==5 的所有后续行,直到score 不是5。(所以,例如,对于I == 4,我想保留日期 4、5、6) 的数据。

  2. 对于score == 5 的所有其他日期,我想用他们前两个分数的平均值替换他们的分数(或者只是他们之前的分数,如果他们只有一个之前的分数)。

所以,我想最终得到的表格是:

   ID date score
 1:  1    1   4.0
 2:  1    2   3.0
 3:  1    3   2.0
 4:  1    4   2.0
 5:  1    5   4.0
 6:  1    6   1.0
 7:  2    4   2.0
 8:  2    5   1.0
 9:  2    6   4.0
10:  3    1   2.0
11:  3    2   1.0
12:  3    3   1.5
13:  3    4   1.5
14:  3    5   1.5
15:  3    6   3.0
16:  4    4   2.0
17:  4    5   4.0
18:  4    6   3.0  

解决此问题的最佳方法是什么?我想这是shift.I 的某种组合,但我无法将它们组合在一起。

【问题讨论】:

  • 对于第一部分,你可以做data[, if(date[1L] == 1L) .SD[which.max(score != 5L):.N], by = ID]也许

标签: r data.table


【解决方案1】:
# find rows satisfying 1st condition
torm = data[, if(score[1] == 5 & date[1] == 1) .I
            , by = .(ID, rleid(score), cumsum(date == 1))]$V1

library(caTools) # for running mean

data[-torm    # remove the extra rows
   # add a running mean
   ][, mn := runmean(score, 2, endrule = 'keep', align = 'right'), by = ID
   # compute the new score - a little care needed here in case we only have 5's in a group
   ][, new.score := ifelse(score == 5, mn[which(score != 5)[1]], score)
     , by = .(ID, cumsum(score != 5))][]
#    ID date score  mn new.score
# 1:  1    1     4 4.0       4.0
# 2:  1    2     3 3.5       3.0
# 3:  1    3     2 2.5       2.0
# 4:  1    4     2 2.0       2.0
# 5:  1    5     4 3.0       4.0
# 6:  1    6     1 2.5       1.0
# 7:  2    4     2 2.0       2.0
# 8:  2    5     1 1.5       1.0
# 9:  2    6     4 2.5       4.0
#10:  3    1     2 2.0       2.0
#11:  3    2     1 1.5       1.0
#12:  3    3     5 3.0       1.5
#13:  3    4     5 5.0       1.5
#14:  3    5     5 5.0       1.5
#15:  3    6     3 4.0       3.0
#16:  4    4     2 2.0       2.0
#17:  4    5     4 3.0       4.0
#18:  4    6     5 4.5       3.0

【讨论】:

  • 这可以解决问题,尽管我必须使用endrule = 'mean' 才能得到我想要的(在我的非示例使用中,我想要一个运行平均窗口为 5,所以当少于 5看看我想要的任何东西的平均值,而不仅仅是最近的)。谢谢!
【解决方案2】:

na.locf 来自zoo 包:

library(zoo)

DF <- data.frame(ID = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4), 
                 date = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6), 
                score = c(4,3,2,2,4,1,5,5,5,2,1,4,2,1,5,5,5,3,5,5,5,2,4,5))



#mark rows for deletion

DF$markForDel=NA

DF$markForDel[DF$date==1 & DF$score==5]=1

DF$markForDel[DF$score!=5]=0

DF$markForDel = zoo::na.locf(DF$markForDel)


newDF = DF[DF$markForDel!=1,]
rownames(newDF)=NULL


#impute mean of previous score where score == 5
newDF$score[newDF$score==5]=NA

newDF$imputedScore = sapply(1:nrow(newDF),function(x)  {
ifelse(x>3 & is.na(newDF$score[x]),mean(c(newDF$score[x-1],newDF$score[x-2]) ),newDF$score[x]) })               


newDF$imputedScore = zoo::na.locf(newDF$imputedScore)

输出:

newDF
#   ID date score markForDel imputedScore
#1   1    1     4          0          4.0
#2   1    2     3          0          3.0
#3   1    3     2          0          2.0
#4   1    4     2          0          2.0
#5   1    5     4          0          4.0
#6   1    6     1          0          1.0
#7   2    4     2          0          2.0
#8   2    5     1          0          1.0
#9   2    6     4          0          4.0
#10  3    1     2          0          2.0
#11  3    2     1          0          1.0
#12  3    3    NA          0          1.5
#13  3    4    NA          0          1.5
#14  3    5    NA          0          1.5
#15  3    6     3          0          3.0
#16  4    4     2          0          2.0
#17  4    5     4          0          4.0
#18  4    6    NA          0          3.0

【讨论】:

  • 这不满足任何一个条件,因为这些都不是每个 ID。
猜你喜欢
  • 2018-05-07
  • 1970-01-01
  • 2013-02-17
  • 2016-12-31
  • 1970-01-01
  • 2019-11-25
  • 2014-09-19
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多