cbind 1:nrows 相同的 ID 变量值到原始 data.frame答案

【问题标题】：cbind 1:nrows of same ID variable value to original data.framecbind 1:nrows 相同的 ID 变量值到原始 data.frame
【发布时间】：2016-03-19 20:12:29
【问题描述】：

我有一个大型数据框，其中变量 id（第一列）在第二列中以不同的值重复出现。我的想法是订购数据框，将其拆分为一个列表，然后应用一个函数，该函数将序列 1:nrows(variable id) 绑定到每个组。到目前为止我的代码：

DF <- DF[order(DF[,1]),]
DF <- split(DF,DF[,1])
DF <- lapply(1:length(DF), function(i) cbind(DF[[i]], 1:length(DF[[i]])))

但这给了我一个错误：参数意味着不同的行数。

你能详细说明吗？

> head(DF, n=50)
   cell     area
1     1 121.2130
2     2  81.3555
3     3  81.5862
4     4  83.6345
...
33    1 121.3270
34    2  80.7832
35    3  81.1816
36    4  83.3340

DF <- DF[order(DF$cell),]

我想要的是：

> head(DF, n=50)
     cell    area counter
1       1 121.213 1
33      1 121.327 2
65      1 122.171 3
97      1 122.913 4
129     1 123.697 5
161     1 124.474 6

...等等。

这是我的代码：

cell.areas.t <- function(file) {

    dat = paste(file)

    DF <- read.table(dat, col.names = c("cell","area"))
    DF <- splitstackshape::getanID(DF, "cell")[]  # thanks to akrun's answer


    ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = cell)) +       
        geom_line(aes(group = cell)) + geom_point(size=0.1)
}

情节是这样的：

大多数细胞的面积增加，只有一些减少。这只是第一次尝试可视化我的数据，所以你不能很好地看到这些区域由于细胞分裂而周期性地下降。

其他问题：

有一个问题我事先没有考虑到，那就是在细胞分裂后，一个新的细胞被添加到 data.frame 并被赋予初始索引 1（你在图像中看到所有细胞都开始从.id = 1，而不是以后），这不是我想要的 - 它需要继承其创建时间的索引。我首先想到的是我可以使用一种解析机制来为新添加的单元格变量完成这项工作：

DF$.id[DF$cell != temporary.cellindex] <- max(DF$.id[DF$cell != temporary.cellindex])

你有更好的主意吗？谢谢。

有一个可以缓解问题的边界条件：开始时的固定单元数 (32)。另一种解决方案是在创建最后一个子单元之前删除所有数据。

更新：其他问题已解决，代码如下：

cell.areas.t <- function(file) {
    dat = paste(file)
    DF <- read.table(dat, col.names = c("cell","area"))
    DF$.id <- c(0, cumsum(diff(DF$cell) < 0)) + 1L # Indexing

    title <- getwd()

    myplot <- ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = factor(cell))) +
        geom_line(aes(group = cell)) + geom_line(size=0.1) + theme(legend.position="none") + ggtitle(title)

    #save the plot
    image=myplot
    ggsave(file="cell_areas_time.svg", plot=image, width=10, height=8)

}

【问题讨论】：

标签： r ggplot2 dataframe lapply splitstackshape

【解决方案1】：

我们可以从splitstackshape使用getanID

library(splitstackshape)
getanID(DF, "cell")[]

【讨论】：

也很不错！ DF <- splitstackshape::getanID(DF, "cell")[] 给了我一个名为 .id 的附加列

【解决方案2】：

有一种更简单的方法可以实现这一目标。将ave 与seq.int 一起使用

 DF$group_seq <- ave(DF, DF[,1], FUN=function(x){ seq.int(nrow(x)) } )

【讨论】：

这太棒了！谢谢！实际上给了我两个额外的列（group_seq.cell，group_seq.area），但这不是什么大问题。