基于列分割数据框答案

【问题标题】：Segment a dataframe based on a column基于列分割数据框
【发布时间】：2025-11-27 11:45:02
【问题描述】：

我有一个包含两列的数据框。一个用于数字，另一个用于标签示例

我想基本上分割这个数据帧并将第二列转换为包含单词的向量，条件是第一列上的 any 两个值之间的差异为

Expected Result is

C("ABC","ADK")

这里的示例我们将有一个向量 C，其中包含 ABC 和 ADK 作为单词，因为 row4 和 row3 之间的差异 > 1000

知道如何在不消耗大量计算的情况下做到这一点吗？

【问题讨论】：

不完全清楚你想要什么。请发布预期结果。

标签： r

【解决方案1】：

我没有在更大的数据集上对此进行测试，但以下应该可以工作：

df <- data.frame(Col1=c(200, 300, 350, 2000, 2200, 2300), 
                 Col2=c("A", "B", "C", "A", "D", "K"))

sapply(split(df$Col2, 
             cumsum(c(1, (diff(df$Col1) > 1000)))), 
       paste, collapse="")
#     1     2 
# "ABC" "ADK"

在上面：

diff(df$Col1) > 1000 返回 TRUE 和 FALSE 的向量
c(1, (diff(df$Col1) > 1000)) 将该逻辑向量强制转换为数字并添加一个 1 作为第一组的起点。因此，我们现在有一个看起来像 1 0 0 1 0 0 的向量。
我们现在可以在该向量上使用 cumsum() 来创建我们想要在其中拆分数据的“组”。
sapply 等已完成从 Col2 粘贴相关详细信息以获取您的（命名）向量。

【讨论】：

请求是在一个组内，any 两点之间的距离小于 1000。例如将 2300 替换为 3100，您的代码仍将放置 2000、2200 和3100 在一起，虽然 2000 和 3100 之间的距离大于 1000。聚类救援！

【解决方案2】：

又一个答案，只是因为还没有人提到你的问题是Cluster Analysis的经典案例。也因为所有其他答案都是错误的，因为它们只是在比较所有成对距离时才比较连续点之间的距离。

可以通过hierarchical clustering 和complete linkage 来查找任意两点之间的距离小于阈值的点组。使用 R 很容易：

df <- data.frame(Col1 = c(200, 300, 350, 2000, 2200, 2300), 
                 Col2 = c("A", "B", "C", "A", "D", "K"))

tree <- hclust(dist(df$Col1), method = "complete")
groups <- cutree(tree, h = 1000)
# [1] 1 1 1 2 2 2
sapply(split(df$Col2, groups), paste, collapse = "")
#     1     2 
# "ABC" "ADK"

【讨论】：

每天学习新东西！ +1

【解决方案3】：

根据您的说明进行了编辑

# SAMPLE DATA
df <- data.frame(Col1=c(200, 300, 350, 2000, 2200, 2300, 4500), Col2=c("A", "B", "C", "A", "D", "K", "M"))
df

# Make sure they are the correct mode
df$Col1 <- as.numeric(as.character(df$Col1))
df$Col2 <- as.character(df$Col2)

lessThan <- which(abs(df$Col1[-length(df$Col1)] - df$Col1[-1]) > 1000 )

lapply(lessThan, function(ind)
  c( paste(df$Col2[1:ind], collapse=""),
      paste(df$Col2[ind+1:length(df$Col2)], collapse="") )
)

结果：

  [[1]]
  [1] "ABC"   "ADKM"

  [[2]]
  [1] "ABCADK" "M"

【讨论】：

感谢@Ricardo，但它只是在固定值 1000 上拆分，这里的目的是根据第二列中的行之间的差异而不是固定值 1000 拆分单词/跨度>
这不可能。我什至不明白输出的含义。为什么要列出两个向量？你愿意解释一下，你还是@Dar？
另外，您似乎也犯了我向@AnandaMahto 指出的错误：您只查看连续点之间的距离，而应考虑所有成对距离。
@flodel，你可能是对的。上面的方法当然只看连续的点。

【解决方案4】：

这是一种选择：

extractGroups <- function(data, threshold){
    #calculate which differences are greater than threshold between values in the first column
    dif <- diff(data[,1]) > threshold

    #edit: as @Ananda suggests, `cumsum` accomplishes these three lines more concisely.

    #identify where the gaps of > threshold are
    dif <- c(which(dif), nrow(data))        
    #identify the length of each of these runs
    dif <- c(dif[1], diff(dif))     
    #create groupings based on the lengths of the above runs
    groups <- inverse.rle(list(lengths=dif, values=1:length(dif)))

    #aggregate by group and paste the characters in the second column together
    aggregate(data[,2], by=list(groups), FUN=paste, collapse="")[,2]
}

还有一个关于你的数据的例子

extractGroups(read.table(text="1 200 A
2 300 B
3 350 C
4 2000 A
5 2200 D
6 2300 K", row.names=1), 1000)

[1] "ABC" "ADK"

【讨论】：