R函数对数据集进行分区答案

【问题标题】：R function to partition data setR函数对数据集进行分区
【发布时间】：2015-03-09 09:07:25
【问题描述】：

谁能帮忙调试一个函数。这是为了做

dat3 <- c(4,7,5,7,8,4,4,4,4,4,4,7,4,4,8,8,5,5,5,5)

myfunc(dat3, chunksize = 8)
##  [1] 4 7 5 8 4 4 4 4   4 7 5 8 4 4 5 5   4

将数据划分为 sizer 的块，并确保每个块中都存在每个级别。该函数适用于玩具示例

myfunc <- function(x, chunksize = 8) {
    numChunks <- ceiling(length(x) / chunksize)
    uniqx <- unique(x)
    lastChunkSize <- chunksize * (1 - numChunks) + length(x)
    ## check to see if it is mathematically possible
    if (length(uniqx) > chunksize)
        stop('more factors than can fit in one chunk')
    if (any(table(x) < numChunks))
        stop('not enough of at least one factor to cover all chunks')
    if (lastChunkSize < length(uniqx))
        stop('last chunk will not have all factors')
    ## actually arrange things in one feasible permutation
    allIndices <- sapply(uniqx, function(z) which(z == x))
    ## fill one of each unique x into chunks
    chunks <- lapply(1:numChunks, function(i) sapply(allIndices, `[`, i))
    remainder <- unlist(sapply(allIndices, tail, n = -3))
    remainderCut <- split(remainder, ceiling(seq_along(remainder)/4))
    ## combine them all together, wary of empty lists
    finalIndices <- sapply(1:numChunks,
           function(i) {
               if (i <= length(remainderCut))
                   c(chunks[[i]], remainderCut[[i]])
               else
                   chunks[[i]]
           })
           save(finalIndices,file="finalIndices")
    x[unlist(finalIndices)]

}

问题是我想从函数中获取重新排列的索引（所以这里称为最终索引）。问题是，对于我的真实数据集（https://www.dropbox.com/s/n3wc5qxaoavr4ta/j.RData?dl=0），该函数不起作用。

数据作为因素 https://www.dropbox.com/s/0ue2xzv5e6h858q/t.RData?dl=0

我根据函数第一行中存在的级别数将 chunkszie 参数更改为 9847 I）。问题是当我从保存的文件中访问 finalIndices 时，我得到一个暗淡为 137 60 的矩阵。它没有为我的所有观察结果提供索引（近 600k）。有人可以告诉我我做错了什么吗？我知道 60 是块的数量（nrows/chunksize），但 137 似乎不适合。

【问题讨论】：

我认为你工作非常努力。但第一件事：你为什么希望你的数据是factors，而你提供的样本是不是因素。如果您希望所有值都出现在所有块中，我建议首先对输入进行排序，然后在输出“块”中分配每个唯一值。
所以让我们备份一下。手头的实际问题是什么？我愿意打赌一两个互联网，有一种更直接的方法可以进行您所瞄准的筛选/分类。那么你能解释一下这个问题的背景是什么吗？
我还更新了问题并将因子包含为 t.RData
@Carl 数据通过统计模型逐块读取到内存中，该统计模型要求每个块具有所有级别的因子。这也是我想到这个功能的原因。
听起来像是一个格式错误的统计模型，因为按因子重新分组可以改变表观分布。

标签： r function partition

【解决方案1】：

remainderCut <- split(remainder, ceiling(seq_along(remainder)/4)) 行是硬编码到玩具数据集的，只是在每个块中添加四个元素，这会对其他数据集产生错误的结果。

虽然这个问题可以通过修改你的代码来解决，但我想出了一个稍微不同的方法来解决这个问题：

library(data.table)

generate.chunks <- function(dat3, chunksize = 8) {
    # get number of unique values
    freqs <- table(dat3)

    # get chunk sizes
    chunk.sizes <- rep(chunksize,length(dat3) %/% chunksize)    
    last.chunk.size <-  length(dat3) %% chunksize
    if (last.chunk.size > 0) chunk.sizes <- c(chunk.sizes,last.chunk.size)

    # few checks
    if (chunksize < length(freqs)) 
        stop(sprintf('Chunk size is smaller than the number of factors: %i elements in a chunk, %i factors. Increase the chunk size',chunksize,length(freqs)))
    if (chunk.sizes[length(chunk.sizes)] < length(freqs)) 
        stop(sprintf('Last chunk size is smaller than the number of factors: %i elements in the chunk, %i factors. Use a different chunk size',chunksize,length(freqs)))
    if (min(freqs) < length(chunk.sizes))
        stop(sprintf('Not enough values in a factor to populate every chunk: %i < %i. Increase the chunk size',min(freqs),length(chunk.sizes)))

    # make sure that each chunk has at least one factor
    d.predefined <- data.frame(
            chunk = rep(1:length(chunk.sizes),each=length(freqs)),
            i     = rep(1:length(freqs),length(chunk.sizes))
    )

    # randomly distribute the remaining values
    d.sampled <- data.frame(
        chunk = unlist(mapply(rep,1:length(chunk.sizes),chunk.sizes - length(freqs),SIMPLIFY=F)),
        i     = sample(unlist(mapply(rep,1:length(freqs),freqs - length(chunk.sizes))))
    )

    # put the predefined and sampled results together and split
    d.result <- rbind(d.predefined,d.sampled)

    # calculate indices
    indices <- sapply(names(freqs),function(s) which(dat3==s))
    dt <- as.data.table(d.result)
    dt[,ind:=indices[[i]],by=i]
    finalIndices <- split(dt$ind,dt$chunk)
    save(finalIndices,file="finalIndices")

    names(freqs)[d.result$i]
}

【讨论】：

哦，好吧..但是你能告诉我我如何从你的方法中获得如何重新排列元素的索引吗？必须让索引将它们用于完整的表而不是单个列...
哦，你确实需要它们...我已经修改了代码来计算它们。