R中的高效子集和采样答案

【问题标题】：Efficient subsetting and sampling in RR中的高效子集和采样
【发布时间】：2017-12-26 05:21:39
【问题描述】：

下面是我一直在开发的模拟的 R 代码的 sn-p。

hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), 
                ind.index, 
                sample(i, size = 1, replace = TRUE)]

上述 sn-p 中的ind.index 包含对sample(...) 的单个调用

我在 RStudio 中分析了我的模拟，这条线确实是运行时和内存方面的瓶颈（运行时约 30000 毫秒，运行时约 7000 MB）。

有没有更有效的方式来表达下面的sn-p，这样会更快？

在完全进入 Rcpp 之前，我想完全用尽我的基本 R/包选项。

一个选项可能是 plyr/dplyr 包（dplyr 本质上依赖于 Rcpp）。因为pop是一个数组，所以为了使用dplyr，需要转换成数据框。然后我可以用来自dplyr 的sample_n(...) 替换sample(...)。

目标是最终编写一个包，因此调用 .Internal(sample(...)) 虽然可能更快，但不允许 CRAN 提交。

以下是完整代码：

## Set up container(s) to hold the identity of each individual from each permutation ##

num.specs <- ceiling(N / K)

pop <- array(dim = c(c(perms, num.specs), K))

## Create an ID for each haplotype ##

haps <- as.character(1:Hstar)

## Assign individuals (N) to each subpopulation (K) ##

specs <- 1:num.specs

## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##

for (j in 1:perms) {
    for (i in 1:K) {
            pop[j, specs, i] <- sample(haps, size = num.specs, replace = TRUE, prob = probs)
        }
}

## Make a matrix to hold individuals from each permutation ##

HAC.mat <- array(dim = c(c(perms, num.specs), K))

## Perform haplotype accumulation ##

for (k in specs) {
    for (j in 1:perms) {
        for (i in 1:K) {
            ind.index <- sample(specs, size = k, replace = FALSE) # which individuals are sampled
            hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), ind.index, sample(i, size = 1, replace = TRUE)] # extract those individuals from a permutation
            HAC.mat[j, k, i] <- length(unique(hap.plot)) # how many haplotypes recovered a given sampling intensity (k) from each permutation (j)
        }
    }
}

运行：

K <- 1 # number of subpopulations
N <- 100 # number of individuals
Hstar <- 10 # number of haplotypes
probs <- rep(1/Hstar, Hstar) # haplotype frequency distribution 
perms <- 10000 # number of permutations

这是一个小例子，速度很快。然而，我的模拟的强大之处在于调查更大的输入参数值，但这会导致代码相当慢。

非常感谢并热烈欢迎任何帮助。

【问题讨论】：

上述代码 sn-p 是嵌套 for 循环的一部分。 i 是引用“pop”变量中的子数组的索引。所有子数组都具有相等（相同）的维度
有没有办法可以在我的代码 sn-p 中对 sample(...) 进行一次调用而不是多次调用？这可能会加快速度。
我刚刚按要求发布了完整的代码。现在应该更清楚了。感谢您的观看！
回答您的问题...是的，pop 数据包含在维度为 c(c(perms, num.specs), K) 的 3 维数组中

标签： r performance subset

【解决方案1】：

K <- 1 # number of subpopulations
N <- 100 # number of individuals
Hstar <- 10 # number of haplotypes
probs <- 1/Hstar # haplotype frequency distribution 
perms <- 10000    
num.specs <- ceiling(N / K)    

## Create an ID for each haplotype ##
haps <- seq_len(Hstar)

## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
sim_fun <- function()
{
  return(sample( x = haps, 
                 size = num.specs, 
                 replace = TRUE, 
                 prob = rep(0.1, Hstar)))
}

set.seed(2L)
pop <- array(dim = c(num.specs, perms, K))
for (i in 1:K) {
  pop[, , i] <- replicate(perms, sim_fun())
}

嵌套的 for 循环减少了一级，这将显着提高效率，因为外部循环表示子群体的数量，与个体数量和排列数量相比，这很可能是一个很小的数字。您无法避免在三个场合取样，因为三个不同的维度具有不同的长度。

# n_ind = number of individuals
# n_perm = number of permutations
# n_subpop = number of subpopulations
# prob = sampling probability
# FUN = summary statistics function

# summary statistics
extract_stats <- function(n_ind, n_perm, n_subpop, prob, FUN, ... )
{

  ijk <- dim(pop)
  sapply( seq_len(n_subpop), function( y ){
    pop_dat <- pop[sample( x = seq_len(ijk[1]), size = n_ind, replace = TRUE, prob = rep( prob, ijk[1] ) ),
                   sample( x = seq_len(ijk[2]), size = n_perm, replace = TRUE, prob = rep( prob, ijk[2] ) ),
                   sample( x = seq_len(ijk[3]), size = y, replace = TRUE, prob = rep( prob, ijk[3] ) )]
    ifelse( test = is.matrix(pop_dat), 
            yes = apply( pop_dat, MARGIN = 2, FUN = FUN ),
            no = do.call(FUN, c( list(pop_dat), ...) ))
  })
}

# median of haplotype id
replicate(10, extract_stats( n_ind = 100, n_perm = 2, n_subpop = 2, prob = 0.1, FUN = median))
# minimum of haplotype id
replicate(10, extract_stats( 100, 2, 2, 0.1, min))
# maximum of haplotype id
replicate(10, extract_stats( 100, 2, 2, 0.1, max))
# histogram of haplotype id distribution
replicate(1, extract_stats( n_ind = 100, n_perm = 1, n_subpop = 1, prob = 0.1, FUN = hist, xlab = "haplotype_id", main = "title"))

【讨论】：

概率不能大于 1。在你的问题中，你有 0.1 和 10。
所以我只使用了 0.1。如果您想要不同概率分布的单倍型 ID，请再创建一个维度来保存此数据。例如pop <- array(dim = c(num.specs, perms, K, 2))
使用set.seed(number) 获得可重现的结果。否则在不设置伪随机数生成器的情况下多次运行sample()会得到不同的结果
注意：我切换了行数和列数，因此您的代码必须考虑到这一点。现在行表示个人的数量。在您的问题中，列表示人数
你想用代码做什么：执行单倍型积累？