R - 从数据框中过滤数据答案

【问题标题】：R - Filter Data from a data frameR - 从数据框中过滤数据
【发布时间】：2012-02-12 21:36:56
【问题描述】：

我是 R 的新手，真的不确定如何过滤日期框架中的数据。

我创建了一个包含两列的数据框，包括每月日期和相应的温度。它的长度为 324。

> head(Nino3.4_1974_2000)
  Month_common               Nino3.4_degree_1974_2000_plain
1   1974-01-15                       -1.93025
2   1974-02-15                       -1.73535
3   1974-03-15                       -1.20040
4   1974-04-15                       -1.00390
5   1974-05-15                       -0.62550
6   1974-06-15                       -0.36915

过滤规则是选择大于或等于0.5度的温度。此外，它必须至少连续 5 个月。

我已经剔除了温度低于 0.5 度的数据（见下文）。

for (i in 1) {
el_nino=Nino3.4_1974_2000[which(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain >= 0.5),]
}

> head(el_nino)
   Month_common               Nino3.4_degree_1974_2000_plain
32   1976-08-15                      0.5192000
33   1976-09-15                      0.8740000
34   1976-10-15                      0.8864501
35   1976-11-15                      0.8229501
36   1976-12-15                      0.7336500
37   1977-01-15                      0.9276500

但是，我仍然需要连续提取 5 个月。我希望有人可以帮助我。

【问题讨论】：

您的Month_common 行之间的差异是否总是一个月？
是的，间隔是一个月。

标签： r filter dataframe

【解决方案1】：

这是利用月份是固定月份总是相隔一个月这一事实的一种方法。比问题减少到找到 5 个连续的温度 >= 0.5 度的行：

# Some sample data
d <- data.frame(Month=1:20, Temp=c(rep(1,6),0,rep(1,4),0,rep(1,5),0, rep(1,2)))
d

# Use rle to find runs of temps >= 0.5 degrees
x <- rle(d$Temp >= 0.5)

# The find the last row in each run of 5 or more
y <- x$lengths>=5 # BUG HERE: See update below!
lastRow <- cumsum(x$lengths)[y]

# Finally, deduce the first row and make a result matrix
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow) 
res
#     firstRow lastRow
#[1,]        1       6
#[2,]       13      17

更新我也有一个错误，它检测到 5 个值小于 0.5 的运行。这是更新后的代码（和测试数据）：

d <- data.frame(Month=1:20, Temp=c(rep(0,6),1,0,rep(1,4),0,rep(1,5),0, 1))
x <- rle(d$Temp >= 0.5)
y <- x$lengths>=5 & x$values
lastRow <- cumsum(x$lengths)[y]
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow) 
res
#     firstRow lastRow
#[2,]       14      18

【讨论】：

我不知道，它不能正常工作。尤其是当数据的数字小于 0.5 时。
@YuDeng - 哎呀小错误。更新了答案。

【解决方案2】：

如果你总能依赖1个月的间隔，那我们暂时放弃时间信息吧：

temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain

因此，由于该向量中的每个温度总是相隔一个月，因此我们只需要寻找 temps[i]>=0.5 的运行，并且运行必须至少 5 长。

如果我们执行以下操作：

ofinterest <- temps >= 0.5

我们将有一个向量 ofinterest，其值为 TRUE FALSE FALSE TRUE TRUE .... 等，当 temps[i] 为 >= 0.5 时为 TRUE，否则为 FALSE。

要重新表述您的问题，我们只需要查找连续出现至少五个TRUE。

为此，我们可以使用函数rle。 ?rle 给：

> ?rle
Description
     Compute the lengths and values of runs of equal values in a vector
     - or the reverse operation.
Value:
     ‘rle()’ returns an object of class ‘"rle"’ which is a list with
     components:    
 lengths: an integer vector containing the length of each run.
  values: a vector of the same length as ‘lengths’ with the
          corresponding values.

所以我们使用rle 来计算连续连续TRUE 和连续连续FALSE 的所有条纹，并连续查找至少5 个TRUE。

我只是编一些数据来演示：

# for you, temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain
temps <- runif(1000) 

# make a vector that is TRUE when temperature is >= 0.5 and FALSE otherwise
ofinterest <- temps >= 0.5

# count up the runs of TRUEs and FALSEs using rle:
runs <- rle(ofinterest) 

# we need to find points where runs$lengths >= 5 (ie more than 5 in a row), 
# AND runs$values is TRUE (so more than 5 'TRUE's in a row).
streakIs <- which(runs$lengths>=5 & runs$values)

# these are all the el_nino occurences. 
# We need to convert `streakIs` into indices into our original `temps` vector.
# To do this we add up all the `runs$lengths` up to `streakIs[i]` and that gives
#  the index into `temps`.
# that is:
# startMonths <- c()
# for ( n in streakIs ) {
#     startMonths <- c(startMonths,   sum(runs$lengths[1:(n-1)]) + 1
# }
#
# However, since this is R we can vectorise with sapply:
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1)

现在，如果您输入Nino3.4_1974_2000$Month_common[startMonths]，您将获得厄尔尼诺现象开始的所有月份。

归结为几行：

runs <- rle(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain>=0.5) 
streakIs <- which(runs$lengths>=5 & runs$values)
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1)
Nino3.4_1974_2000$Month_common[startMonths]

【讨论】：