根据R中的行差异对行进行分组[重复]答案

【问题标题】：Grouping rows on the basis of row differences in R [duplicate]根据R中的行差异对行进行分组[重复]
【发布时间】：2016-03-15 22:35:18
【问题描述】：

我有一组具有不同采样间隔的动物位置。我想要做的是分组和序列，其中采样间隔匹配某个标准（例如低于某个值）。让我用一些虚拟数据来说明：

start <- Sys.time()
timediff <- c(rep(5,3),20,rep(5,2))
timediff <- cumsum(timediff)

# Set up a dataframe with a couple of time values
df <- data.frame(TimeDate = start + timediff)

# Calculate the time differences between the rows
df$TimeDiff <- c(as.integer(tail(df$TimeDate,-1) - head(df$TimeDate,-1)),NA)

# Define a criteria in order to form groups
df$TimeDiffSmall <- df$TimeDiff <= 5

             TimeDate TimeDiff TimeDiffSmall
1 2016-03-15 23:11:49        5          TRUE
2 2016-03-15 23:11:54        5          TRUE
3 2016-03-15 23:11:59       20         FALSE
4 2016-03-15 23:12:19        5          TRUE
5 2016-03-15 23:12:24        5          TRUE
6 2016-03-15 23:12:29       NA            NA

在这个虚拟数据中，行 1:3 属于一组，因为它们之间的时间差 TimeDiffSmall 等于FALSE）。

结合来自两个多个 SO 答案的信息（例如 part 1），我创建了一个解决此问题的函数。

number.groups <- function(input){
  # part 1: numbering successive TRUE values
  input[is.na(input)] <- F
  x.gr <- ifelse(x <- input == TRUE, cumsum(c(head(x, 1), tail(x, -1) - head(x, -1) == 1)),NA)
  # part 2: including last value into group
  items <- which(!is.na(x.gr))
  items.plus <- c(1,items+1)
  sel <- !(items.plus %in% items)
  sel.idx <- items.plus[sel]
  x.gr[sel.idx] <- x.gr[sel.idx-1]
  return(x.gr)


 # Apply the function to create groups
 df$Group <- number.groups(df$TimeDiffSmall)

             TimeDate TimeDiff TimeDiffSmall Group
1 2016-03-15 23:11:49        5          TRUE     1
2 2016-03-15 23:11:54        5          TRUE     1
3 2016-03-15 23:11:59       20         FALSE     1
4 2016-03-15 23:12:19        5          TRUE     2
5 2016-03-15 23:12:24        5          TRUE     2
6 2016-03-15 23:12:29       NA            NA     2

这个功能实际上可以解决我的问题。这就是，这似乎是一种疯狂而新手的做法。有什么功能可以更专业的解决我的问题吗？

【问题讨论】：

cumsum(c(TRUE, diff(df$TimeDate) > 5)) 是否为您的更大示例做此操作？

标签： r grouping

【解决方案1】：

像@thelatemail 一样，我会使用以下内容来获取组 ID。它之所以有效，是因为cumsum() 每次到达一个大于 5 秒时间间隔的元素时，最终都会增加组计数。

df$Group <- cumsum(c(TRUE, diff(df$TimeDate) > 5))
df$Group
# [1] 1 1 1 2 2 2

【讨论】：

或cumsum(c(FALSE,!(diff(df$TimeDate) <= 5)))，如果您想保持选择的框架不变，而不是原来的样子。
@thelatemail 这实际上是我开始使用的，当我看到我需要在结果中添加一个（或将初始 FALSE 更改为 TRUE）以获取组数字从 1 开始，我将其全部翻转为似乎更简单的咒语。
很公平 - 我想这取决于选择标准是否复杂。然后否定它比尝试手动反转它并确保& 和| 都是正确的更容易。
@thelatemail 你是对的。回顾我过去使用它的时间（例如here），看起来这就是我更经常使用的东西，我想你已经确定了为什么会这样。
感谢您的回答并对重复的帖子感到抱歉。现在讨论正在进行中，让我来解决我对您的回答的一个小问题（尽管我被它的简单之美所震撼）：如果我的行之间的时间延迟 > 5 秒，我会想要那些值不属于任何组 (NA)。我稍微更新了我的虚拟数据来解决这一点。使用您的函数，第 4 行和第 5 行现在属于它们自己的组（2 和 3）。有没有办法如此优雅地解决这个问题？