在R中的数字序列中查找具有特定长度的所有子序列答案

【问题标题】：Find all subsequences with specific length in sequence of numbers in R在R中的数字序列中查找具有特定长度的所有子序列
【发布时间】：2019-04-04 23:16:24
【问题描述】：

我想在（最小）长度为 n 的序列中找到所有子序列。假设我有这个序列

sequence <- c(1,2,3,2,5,3,2,6,7,9)

我想找到最小长度为 3 的递增子序列。输出应该是一个数据帧，其中包含找到的每个子序列的开始和结束位置。

df =data.frame(c(1,7),c(3,10))
colnames(df) <- c("start", "end")

有人可以提示如何解决我的问题吗？

提前致谢！

【问题讨论】：

标签： r sequence

【解决方案1】：

仅使用基础 R 的一种方法

n <- 3

do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)), 
        function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))

#  start end
#1    1    3
#4    7   10

splitsequence的索引基于连续递增的子序列，如果每个组的length大于等于n，则返回该组的开始和结束索引。

为了理解，让我们分解并逐步理解它

使用diff我们可以找到连续元素之间的差异

diff(sequence)
#[1]  0  1  1 -1  3 -2 -1  4  1  2

我们检查其中哪些没有增加的子序列

diff(sequence) < 1
#[1] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE

并对它们进行累积和以创建组

cumsum(c(0, diff(sequence)) < 1)
#[1] 1 1 1 2 2 3 4 4 4 4

基于这些组，我们split 来自1:length(sequence) 的索引

split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1))
#$`1`
#[1] 1 2 3

#$`2`
#[1] 4 5

#$`3`
#[1] 6

#$`4`
#[1]  7  8  9 10

使用sapply，我们遍历这个列表并返回列表if的开始和结束索引，列表的length是>=n（在这种情况下为3）

sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)), 
       function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)]))

#$`1`
#start   end 
#    1     3 

#$`2`
# NULL

#$`3`
#NULL

#$`4`
#start   end 
#    7    10

最后，rbind 所有这些都使用do.call。 NULL 元素会被自动忽略。

do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)), 
       function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))

#  start end
#1     1   3
#4     7  10

【讨论】：

很好的答案，但有点难以理解。你能告诉我如何得到递减的子序列吗？我无法修改您的代码以减少子序列。
@MatthiasHab1986 我已经为代码添加了一些解释。

【解决方案2】：

这是另一种使用 base R 的解决方案。我试图很好地评论它，但它可能仍然难以理解。似乎您想要指导/学习，而不是一个直接的答案，因此如果有任何不清楚的地方（或不适用于您的实际应用程序），请务必跟进问题。

另外，对于您的数据，我在末尾添加了一个 12，以确保它返回正确的位置以重复增加大于 n（在本例中为 3）：

# Data (I added 11 on the end)
sequence <- c(1,2,3,2,5,3,2,6,7,9, 12)

# Create indices for whether or not the numbers in the sequence increased
indices <- c(1, diff(sequence) >= 1)
indices
[1] 1 1 1 0 1 0 0 1 1 1 1

现在我们有了索引，我们需要获取重复 >= 3 的开始和结束位置

# Finding increasing sequences of n length using rle
n <- 3
n <- n - 1

# Examples 
rle(indices)$lengths
[1] 3 1 1 2 4

rle(indices)$values
[1] 1 0 1 0 1

# Finding repeated TRUE (1) in our indices vector
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
reps
[1]  TRUE FALSE FALSE FALSE  TRUE

# Creating a vector of positions for the end of a sequence
# Because our indices are true false, we can use cumsum along
# with rle to create the positions of the end of the sequences
rle_positions <- cumsum(rle(indices)$lengths)
rle_positions
[1]  3  4  5  7 11

# Creating start sequence vector and subsetting start / end using reps
start <- c(1, head(rle_positions, -1))[reps]

end <- rle_positions[reps]

data.frame(start, end)
  start end
1     1   3
2     7  11

或者，简而言之：

n <- 3
n <- n-1
indices <- c(1, diff(sequence) >= 1)
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
rle_positions <- cumsum(rle(indices)$lengths)
data.frame(start = c(1, head(rle_positions, -1))[reps], 
           end = rle_positions[reps])
  start end
1     1   3
2     7  11

编辑：@Ronak 的更新让我意识到我应该在第一步中使用带有匿名函数的 diff 而不是 sapply。更新了答案 b/c，它在向量末尾没有增加（例如，sequence <- c(1,2,3,2,5,3,2,6,7,9,12, 11, 11, 20, 100)，还需要在n <- 3 下再添加一行。现在应该可以按预期工作了。

【讨论】：

感谢您的精彩解释。我尝试了扩展序列sequence <- c(1,2,3,2,5,3,2,6,7,9, 12,11,10,9) 的代码，结果是末尾的递减子序列也匹配（从 11 到 14）。我做错了什么？
伟大的收获！因此，如果您将其添加到 reps 行，它将修复它：& rle(indices)$values == 1。当我只想要 TRUE 时，我不小心匹配了重复的 0（FALSES）和 1（TRUES）。我也更新了答案。