【问题标题】:Delete rows between values of a column删除列值之间的行
【发布时间】:2017-04-19 12:07:15
【问题描述】:

我有一个非常大的数据框,我想按 id 删除一列的值之间的行,但前提是它们在这个值内,而不是在开头或结尾。在示例中,我想删除行之间的行 or='base' or='plan'

id <- c(1,1,1,1,1,1,2,2,2,2,2,2)
fd <- c(101,102,103,104,105,106,101,102,103,104,105,106)
rem <- c(100,120,120,140, 140, 150, 200,220,220,250, 300, 310)
or <- c("base", "base", "plan", "base", "plan", "base", "plan", "base", 
"plan", "base", "plan", "base")
df <- data.frame(id, fd, rem, or)

结果:

id1 <- c(rep(1,5), rep(2,4))
fd1 <- c(101,102,103,106, 107, 101,103,105,106)
or1 <- c("base", "base", "plan", "plan", "base", "plan", "plan", "plan", "base")

df1 <- data.frame(id1,fd1,or1)

【问题讨论】:

  • 如果某个 id 有多个 'base'/'plan' 实例怎么办
  • 我想删除同一 ID 的“计划”之间的每一行。例如对于 id 1,我想留下前两个“base”和最后一个(在 id 2 开始之前)

标签: r rows


【解决方案1】:

两种可能的解决方案:

1) 使用基础 R:

idx <- ave(df$or, df$id, FUN = function(x) x=='base' & c('base',head(x,-1))=='plan' & c(tail(x,-1),'base')=='plan')=='FALSE'
df[idx,]

给出:

   id  fd rem   or
1   1 101 100 base
2   1 102 120 base
3   1 103 120 plan
5   1 105 140 plan
6   1 106 150 base
7   2 101 200 plan
9   2 103 220 plan
11  2 105 300 plan
12  2 106 310 base

2) 使用data.table-package:

library(data.table)
setDT(df)

idx <- df[, .I[!(or=='base' & shift(or, fill = 'base')=='plan' & shift(or, fill = 'base', type = 'lead')=='plan')], id]$V1
df[idx]

给出:

   id  fd rem   or
1:  1 101 100 base
2:  1 102 120 base
3:  1 103 120 plan
4:  1 105 140 plan
5:  1 106 150 base
6:  2 101 200 plan
7:  2 103 220 plan
8:  2 105 300 plan
9:  2 106 310 base

或者一口气:

library(data.table)
setDT(df)[df[, .I[!(or=='base' & shift(or, fill = 'base')=='plan' & shift(or, fill = 'base', type = 'lead')=='plan')], id]$V1]

针对评论,您可以使用rle-function 来检测'base'-rows 之间的多个'base'-rows,如下所示(在base R中):

# create new example dataset
df2 <- df[c(1:3,4,4,5:7,8,8,9:12),]

# the new example dataset:
> df2
    id  fd rem   or
1    1 101 100 base
2    1 102 120 base
3    1 103 120 plan
4    1 104 140 base
4.1  1 104 140 base
5    1 105 140 plan
6    1 106 150 base
7    2 101 200 plan
8    2 102 220 base
8.1  2 102 220 base
9    2 103 220 plan
10   2 104 250 base
11   2 105 300 plan
12   2 106 310 base

# define function
f <- function(x) {
  rl <- rle(x)
  rl$values <- !(rl$values == 'base' & c('base',head(rl$values,-1))=='plan' & c(tail(rl$values,-1),'base')=='plan')
  inverse.rle(rl)
}

# apply the function to each id-group and create an index
idx2 <- as.logical(ave(df2$or, df2$id, FUN = f))

# finally subset your data with the logical-index
df2[idx2,]

给出:

> df2[idx2,]
   id  fd rem   or
1   1 101 100 base
2   1 102 120 base
3   1 103 120 plan
5   1 105 140 plan
6   1 106 150 base
7   2 101 200 plan
9   2 103 220 plan
11  2 105 300 plan
12  2 106 310 base

base R 中的另一个选项(受@Frank 在 cmets 中的 data.table 建议的启发):

f2 <- function(x) {
  i <- seq_along(x)
  w <- which(x == 'plan')
  b <- which(x == 'base')
  ib <- b[b > head(w,1) & b < tail(w,1)]
  !(i %in% ib)
}

idx3 <- unlist(by(df2$or, df2$id, f2))
df2[idx3,]

使用data.table,您可以遵循@Frank 的建议:

setDT(df2)
df2[, keep := {isp = or == "plan"; wp = which(isp); r = 1:.N; isp | r < first(wp) | r > last(wp)}, by = id
    ][!!keep]

使用过的数据:

df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), 
                     fd = c(101, 102, 103, 104, 105, 106, 101, 102, 103, 104, 105, 106), 
                     rem = c(100, 120, 120, 140, 140, 150, 200, 220, 220, 250, 300, 310), 
                     or = c("base", "base", "plan", "base", "plan", "base", "plan", "base", "plan", "base", "plan", "base")), 
                .Names = c("id", "fd", "rem", "or"), row.names = c(NA, -12L), class = "data.frame")

【讨论】:

  • 知道如何修改代码以在“计划”之后删除行我有两行或更多行带有“基础”然后再次“计划”。谢谢
  • 您可以确定要保留的那些(所有“计划”以及所有在第一个计划之前或最后一个计划之后),而不是找到要删除的那些,例如df[, keep := {isp = or == "plan"; wp = which(isp); r = 1:.N; isp | r &lt; first(wp) | r &gt; last(wp)}, by=id] 或类似的东西。
  • @Frank thx & 已更新(包括受您启发的 2nd base R 方法)
猜你喜欢
  • 2021-12-04
  • 1970-01-01
  • 1970-01-01
  • 2017-03-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2011-02-17
  • 1970-01-01
相关资源
最近更新 更多