【问题标题】：Assigning a value to each range of consecutive numbers with same sign in R为R中具有相同符号的每个连续数字范围分配一个值
【发布时间】：2017-03-25 01:30:33
【问题描述】：

我正在尝试创建一个数据框，其中存在一个包含表示正数和负数运行长度的值的列，如下所示：

Time  V  Length
0.5  -2  1.5
1.0  -1  1.5
1.5   0  0.0
2.0   2  1.0
2.5   0  0.0
3.0   1  1.75
3.5   2  1.75
4.0   1  1.75
4.5  -1  0.75
5.0  -3  0.75

Length 列对值为正数或负数的时间长度求和。零被赋予0，因为它们是一个拐点。如果没有零分隔符号变化，则在拐点的任一侧对值进行平均。

我正在尝试估算这些值花费正数或负数的时间量。我用for 循环尝试了这个，取得了不同程度的成功，但我想避免循环，因为我正在处理非常大的数据集。

我花了一些时间查看sign 和diff，因为它们在this question about sign changes 中使用。我还查看了使用transform 和aggregate 对连续重复值求和的this question。我觉得我可以将它与sign 和/或diff 结合使用，但我不确定如何将这些总和追溯分配给创建它们的范围或如何处理我正在使用的位置整个拐点的平均值。

任何建议将不胜感激。这是示例数据集：

dat <- data.frame(Time = seq(0.5, 5, 0.5), V = c(-2, -1, 0, 2, 0, 1, 2, 1, -1, -3))

【问题讨论】：

你能给你发布的数据集提供解决方案吗？！
@David 我相信Length 列是发布数据集的解决方案。
@David：我根据我列出的规则“手工”制作了该数据集；这是我希望最终解决方案看起来的样子，但由于样本量有数十万，我无法手动完成整个工作。
我说对了吗：1.5（第 1 行和第 2 行）是 sum(Time[c(0,1)])？那么 0 因为 V[3] == 0？但是为什么我们得到 1 而不是 2 (Time[4])？
@David，没有第 1 行和第 2 行不是 1.5，因为它是前两次的总和，它是 V 为负数直到 V 为 0 的持续时间（从 Time = 0 开始）。一切是 0 之间的差异。如果您查看 plot(dat$Time, dat$V, type = "l") OP 想要 0 之间的 x 距离。

标签： r transform aggregate diff sign

【解决方案1】：

这是我在base R 中完全完成的尝试。

Joseph <- function(df) {
    is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol

    v <- df$V
    t <- df$Time
    sv <- sign(v)
    nR <- length(v)
    v0 <- which(v==0)

    id <- which(abs(c(0, diff(sv))) > 1)  ## This line and (t[id] + t[id - 1L])/2 From @Henrik
    myZeros <- sort(c(v0*t[1L], (t[id] + t[id - 1L])/2))
    lenVals <- diff(c(0,myZeros,t[nR]))   ## Actual values that 
                             ## will populate the Length column

    ## remove values that result from repeating zeros from the df$V column
    lenVals <- lenVals[lenVals != t[1L] | c(!is.wholenumber(myZeros/t[1L]),F)]

    ## Below we need to determine how long to replicate
    ## each of the lenVals above, so we need to find
    ## the starting place and length of each run...
    ## rle is a great candidate for both of these
    m <- rle(sv)        
    ml <- m$lengths
    cm <- cumsum(ml)
    zm <- m$values != 0   ## non-zero values i.e. we won't populate anything here
    rl <- m$lengths[zm]   ## non-zero run-lengths
    st <- cm[zm] - rl + 1L    ## starting index
    out <- vector(mode='numeric', length = nR)
    for (i in 1:length(st)) {out[st[i]:(st[i]+rl[i]-1L)] <- lenVals[i]}
    df$Length <- out
    df
}

这是给定示例的输出：

Joseph(dat)
   Time  V Length
1   0.5 -2   1.50
2   1.0 -1   1.50
3   1.5  0   0.00
4   2.0  2   1.00
5   2.5  0   0.00
6   3.0  1   1.75
7   3.5  2   1.75
8   4.0  1   1.75
9   4.5 -1   0.75
10  5.0 -3   0.75

这是一个更大的例子：

set.seed(142)
datBig <- data.frame(Time=seq(0.5,50000,0.5), V=sample(-3:3, 10^5, replace=TRUE))

library(compiler)
library(data.table)
library(microbenchmark)

c.Joseph <- cmpfun(Joseph)
c.Henrik <- cmpfun(Henrik)
c.Gregor <- cmpfun(Gregor)

    microbenchmark(c.Joseph(datBig), c.Gregor(datBig), c.Henrik(datBig), David(datBig), times = 10)
Unit: milliseconds
            expr        min         lq       mean     median         uq       max neval cld
   David(datBig)    2.20602   2.617742    4.35927   2.788686    3.13630 114.0674    10  a
c.Joseph(datBig)   61.91015   62.62090   95.44083   64.43548   93.20945  225.4576    10   b 
c.Gregor(datBig)   59.25738   63.32861  126.29857   72.65927  214.35961  229.5022    10   b 
 c.Henrik(datBig) 1511.82449 1678.65330 1727.14751 1730.24842 1816.42601 1871.4476    10   c

正如@Gregor 指出的那样，目标是找到每次出现的零之间的 x 距离。这可以通过绘图直观地看到（再次，正如@Gregor 所指出的（许多荣誉顺便说一句））。例如，如果我们绘制datBig 的前 20 个值，我们会得到：

由此，我们可以看到图形为正或负（即不为零（当零重复时会发生这种情况））的 x 距离大约为：

2.0, 1.25, 0.5, 0.75, 2.0, 1.0, 0.75, 0.5

t1 <- c.Joseph(datBig)
t2 <- c.Gregor(datBig)
t3 <- c.Henrik(datBig)
t4 <- David(datBig)

 ##  Correct values according to the plot above (x above a value indicates incorrect value)
 ##  2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50

 ## all correct
 t1$Length[1:20]  
 [1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50

 ## mostly correct
 t2$Length[1:20]                                         x    x    x                   x             x
 [1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 0.75 0.75 0.75 0.00 0.00 0.00 0.50 0.00 0.75 0.25

 ## least correct
 t3$Length[1:20]      x    x         x    x         x    x    x    x    x               x   x    x    x
 [1] 2.00 2.00 2.00 0.50 1.00 1.25 0.75 1.25 0.00 1.75 1.75 0.00 1.50 1.50 0.00 0.00 1.25 1.25 1.25 1.25

 ## all correct
 t4$Length[1:20]  
 [1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50

# agreement with David's solution
all.equal(t4$Length, t1$Length)
[1] TRUE

嗯，看来大卫提供的Rcpp 解决方案不仅准确而且速度极快。

【讨论】：

非常好的速度比较！感谢那。我想知道你是否可以以某种方式替换你的 for 循环？！这可能会进一步提高您的回答速度。
另外，编译器技巧也不错！这通常是一个很好的提升！
@David for 循环绝对是瓶颈。如果您将其注释掉，它的运行速度会快两倍。我尝试了许多替代方案，但它们都非常笨重。我敢打赌有更好的方法来做到这一点。

【解决方案2】：

我花的时间比我愿意承认的要长，但这是我的解决方案。

因为您说您想在大型数据集上使用它（因此速度很重要），所以我使用 Rcpp 编写了一个循环来执行所有检查。对于速度比较，我还创建了另一个包含 500,000 个 data.points 的示例数据集并检查速度（我尝试与其他数据集进行比较，但无法将它们转换为 data.table（否则，这将是不公平的比较......））。如果提供，我很乐意更新速度比较！

第 1 部分：我的解决方案

我的解决方案如下所示：

（在length_time.cpp）

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector length_time(NumericVector time, NumericVector v) {
  double start = 0;
  double time_i, v_i;
  bool last_positive = v[0] > 0;
  bool last_negative = v[0] < 0;
  int length_i = time.length();
  NumericVector ret_vec(length_i);

  for (int i = 0; i < length_i; ++i) {
    time_i = time[i];
    v_i = v[i];

    if (v_i == 0) { // injection
      if (i > 0) { // if this is not the beginning, then a regime has ended!
        ret_vec[i - 1] = time_i - start;
        start = time_i;
      }
    } else if ((v_i > 0 && last_negative) || (v_i < 0 && last_positive)) { 
      ret_vec[i - 1] = (time_i + time[i - 1]) / 2 - start;
      start = (time_i + time[i - 1]) / 2;
    }

    last_positive = v_i > 0;
    last_negative = v_i < 0;
  }
  ret_vec[length_i - 1] = time[length_i - 1] - start;

  // ret_vec now only has the values for the last observation
  // do something like a reverse na_locf...
  double tmp_val = ret_vec[length_i - 1];
  for (int i = length_i - 1; i >= 0; --i) {
    if (v[i] == 0) {
      ret_vec[i] = 0;
    } else if (ret_vec[i] == 0){
      ret_vec[i] = tmp_val;
    } else {
      tmp_val = ret_vec[i];
    }
  }
  return ret_vec;
}

然后在 R 文件中（即length_time.R）：

library(Rcpp)
# setwd("...") #to find the .cpp-file
sourceCpp("length_time.cpp")

dat$Length <- length_time(dat$Time, dat$V)
dat
# Time  V Length
# 1   0.5 -2   1.50
# 2   1.0 -1   1.50
# 3   1.5  0   0.00
# 4   2.0  2   1.00
# 5   2.5  0   0.00
# 6   3.0  1   1.75
# 7   3.5  2   1.75
# 8   4.0  1   1.75
# 9   4.5 -1   0.75
# 10  5.0 -3   0.75

这似乎适用于示例数据集。

第 2 部分：速度测试

library(data.table)
library(microbenchmark)
n <- 10000
set.seed(1235278)
dt <- data.table(time = seq(from = 0.5, by = 0.5, length.out = n),
                 v = cumsum(round(rnorm(n, sd = 1))))

dt[, chg := v >= 0 & shift(v, 1, fill = 0) <= 0]
plot(dt$time, dt$v, type = "l")
abline(h = 0)
for (i in dt[chg == T, time]) abline(v = i, lty = 2, col = "red")

这会产生一个包含 985 个观测值（交叉点）的数据集。

用微基准测试速度

microbenchmark(dt[, length := length_time(time, v)])
# Unit: milliseconds
# expr      min     lq     mean   median       uq      max neval
# dt[, `:=`(length, length_time(time, v))] 2.625714 2.7184 3.054021 2.817353 3.077489 5.235689   100

计算 500,000 个观测值大约需要 3 毫秒。

这对你有帮助吗？

【讨论】：

哇！！！令人难以置信的 Rcpp 实现。我也在我的解决方案上花费了令人尴尬的时间。重复零的情况给我带来了最大的麻烦。此外，当我比较您的解决方案时，我只是将调用 length_time 包装在一个函数中，如下所示：David <- function(df) {df$Length <- length_time(df$Time, df$V); df}。我希望没关系。
效果很好。与 data.table 相比，它可能会稍微松一些，但在这种情况下应该可以忽略不计。

【解决方案3】：

首先找到需要插值的“时间”索引：连续的“V”，在正负值之间缺少零；他们的 abs(diff(sign(V)) 等于 2。

id <- which(abs(c(0, diff(sign(dat$V)))) == 2)

将相关索引之间的平均“时间”和对应的“V”值为零的行添加到原始数据中。还在“时间”= 0 和最后一个时间步添加“V”= 0 行（根据@Gregor 提到的假设）。按“时间”排序。

d2 <- rbind(dat,
            data.frame(Time = (dat$Time[id] + dat$Time[id - 1])/2, V = 0),
            data.frame(Time = c(0, max(dat$Time)), V = c(0, 0))
            )
d2 <- d2[order(d2$Time), ]

计算为零的时间步长之间的时间差，并使用“零组索引”复制它们。

d2$Length <- diff(d2$Time[d2$V == 0])[cumsum(d2$V == 0)]

为原始数据添加值：

merge(dat, d2)

#    Time  V Length
# 1   0.5 -2   1.50
# 2   1.0 -1   1.50
# 3   1.5  0   1.00
# 4   2.0  2   1.00
# 5   2.5  0   1.75
# 6   3.0  1   1.75
# 7   3.5  2   1.75
# 8   4.0  1   1.75
# 9   4.5 -1   0.75
# 10  5.0 -3   0.75

将“长度”设置为0，其中V == 0。

【讨论】：

这对于样本数据集和我的 400,000 点的真实数据集都完美无缺。运行时间不到 4 秒。

【解决方案4】：

这很有效，至少对您的测试用例而言。它应该非常有效。它做了一些假设，我会尝试指出大的。

首先我们提取向量并在开头粘贴 0。我们还将最后一个V设置为0。计算会根据0s之间的时间差，所以我们需要以0s开始和结束。您的示例似乎默认V = 0 为Time = 0，因此初始为0，并且它在最大时间突然停止，因此我们也将V = 0 设置在那里：

Time = c(0, dat$Time)
V = c(0, dat$V)
V[length(V)] = 0

为了填充跳过的 0，我们使用 approx 对 sign(V) 进行线性逼近。它还假设您的采样频率是有规律的，因此我们可以通过将频率加倍来获取所有缺失的 0。

ap = approx(Time, sign(V), xout = seq(0, max(Time), by = 0.25))

我们要填写的值是 0 之间的持续时间，包括观测值和近似值。按照正确的顺序，它们是：

dur = diff(ap$x[ap$y == 0])

最后，我们需要原始数据的索引来填充持续时间。这是这个答案中最骇人听闻的部分，但它似乎有效。也许有人会建议一个很好的简化。

# first use rleid to get the sign groupings
group = data.table::rleid(sign(dat$V))

# then we need to set the groups corresponding to 0 values to 0
# and reduce any group numbers following 0s correspondingly
# lastly we add 1 to everything so that we can stick 0 at the
# front of our durations and assign those to the 0 V values
ind = (group - cumsum(dat$V == 0)) * (dat$V != 0) + 1

# fill it in
dat$Length = c(0, dur)[ind]
dat
#    Time  V Length
# 1   0.5 -2   1.50
# 2   1.0 -1   1.50
# 3   1.5  0   0.00
# 4   2.0  2   1.00
# 5   2.5  0   0.00
# 6   3.0  1   1.75
# 7   3.5  2   1.75
# 8   4.0  1   1.75
# 9   4.5 -1   0.75
# 10  5.0 -3   0.75

【讨论】：

您的所有假设都是正确的。您的解决方案适用于示例数据集，但对于更复杂的数据集，它似乎与approx 不同：请参阅data.frame(Time = seq(0.001, 0.01, 0.001), V = c(-0.002, -0.002, -0.002, 0, -0.001, -0.001, -0.002, -0.001, -0.001, -0.001)) 作为approx 未考虑过零的示例。
但在该示例中没有过零，只是快速过零，然后返回负数。如果您将by = 0.25 调整为by = (Time[2] - Time[1]) / 2，它将推广到任何常规频率，并且至少对于这个评论示例看起来仍然正确...