计算比较连续时间段的值的函数答案

【问题标题】：Function to calculate values comparing sequential time periods计算比较连续时间段的值的函数
【发布时间】：2015-08-07 04:35:01
【问题描述】：

我一直无法在 Stack Overflow 上找到我的查询的解决方案。 This post is similar，但我的数据集略有不同——重要的是——不同（因为我的分组变量中有多个“时间”度量）。

随着时间的推移，我对不同地点的生物体进行了观察。这些站点进一步聚合成更大的区域，因此我希望最终有一个可以在 ddply 中调用的函数来汇总地理区域内每个时间段的数据集。但是，我无法获得所需的功能。

问题

如何循环遍历时间段并与前一个时间段进行比较，计算交叉点（即两个时间段内出现的“站点”数量）和每个时间段内出现的数量之和？

玩具数据集：

time = c(1,1,1,1,2,2,2,3,3,3,3,3)
site = c("A","B","C","D","A","B","C","A","B","C","D","E")
df <- as.data.frame(cbind(time,site))
df$time = as.numeric(df$time)

我的功能

dist2 <- function(df){
  for(i in unique(df$time))
  {
    intersection <- length(which(df[df$time==i,"site"] %in% df[df$time==i-   1,"site"]))
    both <- length(unique(df[df$time==i,"site"])) + length(unique(df[df$time==i-1,"site"]))
  }
  return(as.data.frame(cbind(time,intersection,both)))
  }

dist2(df)

我得到了什么：

dist2(df)
   time intersection both
1     1            3    8
2     1            3    8
3     1            3    8
4     1            3    8
5     2            3    8
6     2            3    8
7     2            3    8
8     3            3    8
9     3            3    8
10    3            3    8
11    3            3    8
12    3            3    8

我期望（希望！）实现的目标：

time intersection both
1    1           NA    4
2    2            3    7
3    3            3    8

一旦我有了一个工作函数，我想在整个数据集上使用它和 ddply 来计算每个区域的这些值。

非常感谢您的任何指示、提示和建议！

我正在跑步：

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

【问题讨论】：

标签： r function time plyr

【解决方案1】：

这是我的内存密集型建议

df <- rbind(df, within(df, {time = time + 1}))
ddply(df, ~time, summarize, intersect = sum(duplicated(site)), both = length(site)) -> res
res <- res[-nrow(res), ]
res

输出：

  time intersect both
1    1         0    4
2    2         3    7
3    3         3    8

将 0 更改为 NA 就完成了。

【讨论】：

【解决方案2】：

您可以使用table 函数确定每个站点每次出现的次数：

(tab <- table(df$time, df$site))
#     A B C D E
#   1 1 1 1 1 0
#   2 1 1 1 0 0
#   3 1 1 1 1 1

通过一些简单的操作，您可以构建包含网站在上一个时间段内出现的次数的相同大小的表格：

(prev.tab <- head(rbind(NA, tab), -1))
#    A  B  C  D  E
#   NA NA NA NA NA
# 1  1  1  1  1  0
# 2  1  1  1  0  0

确定与上一次迭代相同的站点数或上一次迭代中的唯一站点数加上当前迭代中的唯一站点数现在是简单的矢量化操作：

data.frame(time=unique(df$time),
           intersection=rowSums(tab * (prev.tab >= 1)),
           both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE))
#   time intersection both
# 1    1           NA    4
# 2    2            3    7
# 3    3            3    8

因为这不涉及进行一堆涉及时间值对的 intersection 或 unique 调用，所以它应该比循环解决方案更有效：

# Slightly larger dataset with 100000 observations
set.seed(144)
df <- data.frame(time=sample(1:50, 100000, replace=TRUE),
                 site=sample(letters, 100000, replace=TRUE))
df <- df[order(df$time),]
josilber <- function(df) {
  tab <- table(df$time, df$site)
  prev.tab <- head(rbind(NA, tab), -1)
  data.frame(time=unique(df$time),
             intersection=rowSums(tab * (prev.tab >= 1)),
             both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE))
}
# dist2 from @akrun's solution
microbenchmark(josilber(df), dist2(df))
# Unit: milliseconds
#          expr       min        lq      mean    median         uq       max neval
#  josilber(df)  28.74353  32.78146  52.73928  40.89203   62.04933  237.7774   100
#     dist2(df) 540.78422 574.28319 829.04174 825.99418 1018.76561 1607.9460   100

【讨论】：

善用表格，真快的代码。在我的解决方案上做了高于基准测试，它比你的慢了 10 倍多，主要是由于rbind/make.unique

【解决方案3】：

你可以修改函数

dist2 <- function(df){
   Un1 <- unique(df$time)
   intersection <- numeric(length(Un1))
   both <- numeric(length(Un1))

  for(i in seq_along(Un1)){
    intersection[i] <- length(which(df[df$time==Un1[i],"site"] %in% 
             df[df$time==Un1[i-1],"site"]))
    both[i] <- length(unique(df[df$time==Un1[i],"site"])) + 
               length(unique(df[df$time==Un1[i-1],"site"]))
   }
   return(data.frame(time=Un1, intersection, both))
  }

dist2(df)
#    time intersection both
#1    1            0    4
#2    2            3    7
#3    3            3    8

【讨论】：