如何根据上方或下方行的条件删除行答案

【问题标题】：How to remove row based on condition of row above or below如何根据上方或下方行的条件删除行
【发布时间】：2015-06-19 03:24:16
【问题描述】：

我有一个如下的数据框：

     chr   leftPos        ZScore1    ZScore2    ZScore3    ZScore4
      1     24352           34         43          19         43
      1     53534           2          1           -1         -9
      2      34            -15         7           -9         -18
      3     3443           -100        -4          4          -9
      3     3445           -100        -1          6          -1
      3     3667            5          -5          9           5
      3      7882          -8          -9          1           3

我只想保留那些具有相同 chr 并且具有相邻列且 ZScore 朝着相同方向的行。换句话说，如果该 chr 之前或之后的行具有相同的符号（正或负），则应该保留一行。我希望对列名中带有 ZS 的所有列运行此操作，以便输出最终只是满足每行条件的行数。

对于一列，代码应导致：

     chr   leftPos         ZScore
      1     24352           34
      1     53534           2
      3     3443           -100
      3     3445           -100

但最终的输出应该是这样的

         ZScore1    ZScore2    ZScore3    ZScore4
nrow        4         6          4          4


 I have tried bits of code but Im not even really sure how to approach this.

我想我会按 chr 分组，然后查看上面的行是否与当前行相同或相同，然后查看下面的行是否与当前行的方向相同。然后移动到该字符的下一行。

【问题讨论】：

是的。这是故意的
你刚才说这是故意的，那么为什么ZScore2 有七行？
因为每一行的上方或下方都有相同符号的行
我猜ZScore2 应该是 6 行，ZScore1 应该是 4 行
但是你是通过chr来做的，当chr == 2时它只有一行...

标签： r

【解决方案1】：

试试这个包dplyr

library(dplyr)

数据

df <- data.frame(chr=c(1, 1, 2, 3, 3, 3, 3),
             leftPos=c(24352, 53534, 34, 3443, 3445, 3667, 7882),
             ZScore=c(34, 2, -15, -100, -100, 5, -8))

代码

df %>% group_by(chr) %>% 
   filter(sign(ZScore)==sign(lag(ZScore)) | sign(ZScore)==sign(lead(ZScore))) %>% 
   ungroup

【讨论】：

【解决方案2】：

使用data.table 的开发版本的选项（类似于@dimitris_ps 帖子中的方法）。安装devel版本的说明是here

library(data.table)#v1.9.5
na.omit(setDT(df)[, {tmp= sign(ZScore)
  .SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')] },
             by=chr])
#     chr leftPos ZScore
#1:   1   24352     34
#2:   1   53534      2
#3:   3    3443   -100
#4:   3    3445   -100

更新

我们可以创建一个函数

 f1 <- function(dat, ZCol){
    na.omit(as.data.table(dat)[, {tmp = sign(eval(as.name(ZCol)))
     .SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')]},
    by=chr])[, list(.N)]}

 nm1 <- paste0('ZScore', 1:4)
 setnames(do.call(cbind,lapply(nm1, function(x) f1(df1, x))), nm1)[]
 #   ZScore1 ZScore2 ZScore3 ZScore4
 #1:       4       6       4       4

或者我们可以使用set

 res <- as.data.table(matrix(0, ncol=4, nrow=1, 
                  dimnames=list(NULL, nm1)))
 for(j in seq_along(nm1)){
   set(res, i=NULL, j=j, value=f1(df1,nm1[j]))
  }
 res
 #   ZScore1 ZScore2 ZScore3 ZScore4
 #1:       4       6       4       4

数据

df <- structure(list(chr = c(1L, 1L, 2L, 3L, 3L, 3L, 3L),
leftPos = c(24352L, 
53534L, 34L, 3443L, 3445L, 3667L, 7882L), ZScore = c(34L, 2L, 
-15L, -100L, -100L, 5L, -8L)), .Names = c("chr", "leftPos", "ZScore"
), class = "data.frame", row.names = c(NA, -7L))

 df1 <- structure(list(chr = c(1L, 1L, 2L, 3L, 3L, 3L, 3L),
 leftPos = c(24352L, 
 53534L, 34L, 3443L, 3445L, 3667L, 7882L), ZScore1 = c(34L, 2L, 
 -15L, -100L, -100L, 5L, -8L), ZScore2 = c(43L, 1L, 7L, -4L, -1L, 
 -5L, -9L), ZScore3 = c(19L, -1L, -9L, 4L, 6L, 9L, 1L),
 ZScore4 = c(43L, 
 -9L, -18L, -9L, -1L, 5L, 3L)), .Names = c("chr", "leftPos",
  "ZScore1", "ZScore2", "ZScore3", "ZScore4"), class = "data.frame",
  row.names = c(NA, -7L))

【讨论】：

如果我在数据框中有一系列列，例如 ZScore1 ZScore2 等，我将如何运行它以便循环运行每一列？
@user3632206 能否请您更新您的帖子并提供一个示例和预期结果，因为描述中有点不清楚。
使用上面的代码，我得到错误“符号”对因素没有意义。
@user3632206 我想你有factor 列。请通过as.numeric(as.character( 将其更改为numeric。你能在我帖子中的df1 数据上试试吗？
好的，您的数据似乎工作正常。如果我想对不同的列名进行这项工作，大概我可以使用 nm1

【解决方案3】：

这是一个可能的data.table 解决方案，它使用来自dev version 的rleid

setDT(df)[, indx := .N, by = .(chr, rleid(sign(ZScore)))][indx > 1L]
#    chr leftPos ZScore indx
# 1:   1   24352     34    2
# 2:   1   53534      2    2
# 3:   3    3443   -100    2
# 4:   3    3445   -100    2

编辑（每个新数据）

indx <- paste0('ZScore', 1:4)
temp <- setDT(df)[, lapply(.SD, function(x) rleid(sign(x))), .SDcols = indx, by = chr]

Res <- setNames(numeric(length(indx)), indx)
for (i in indx) Res[i] <- length(temp[, .I[.N > 1L], by = c("chr", i)]$V1)
Res
# ZScore1 ZScore2 ZScore3 ZScore4 
#       4       6       4       4

【讨论】：

@akrun，嗯，是的。我已经修好了。虽然有趣的行为。感谢您的关注
@akrun 确实如此，但它正在做一些我希望避免的不必要的副本。需要考虑清楚。
这在我创建的几个示例数据中运行良好。所以它避免了复制，对吧？
@akrun 我猜它的内存效率会更高，尽管它仍然有额外的 [ 开销

【解决方案4】：

这是一个执行您想要的操作的循环。没有花哨的包裹。只需检查后面的行和前面的行是否匹配 - 如果匹配，则继续，否则，剥离该行并检查相同的位置。

chr = c(1,1,2,3,3,3,3,3)
mat = cbind(chr,rnorm(8))

i = 1
while(i <= nrow(mat)){
  if (mat[max(i-1,1),1] != mat[i,1] & mat[min(nrow(mat),i+1),1] != mat[i,1]){
    mat = mat[-i,]
  } else {
    i = i+1
  }
}

【讨论】：