大数据框上的 R 中的 T 检验答案

【问题标题】：T test in R over large data frame大数据框上的 R 中的 T 检验
【发布时间】：2017-02-17 19:28:08
【问题描述】：

我正在尝试对大型数据框进行 t 检验。数据框在列中包含 CpG 位点，在行中包含病例/对照组。

数据样本：

        Type cg00000029 cg00000108 cg00000109 cg00000165 cg00000236 cg00000289
1  Normal.01    0.32605    0.89785    0.73910    0.30960    0.80654    0.60874
2  Normal.05    0.28981    0.89931    0.72506    0.29963    0.81649    0.62527
3  Normal.11    0.25767    0.90689    0.77163    0.27489    0.83556    0.66264
4  Normal.15    0.26599    0.89893    0.75909    0.30317    0.81778    0.71451
5  Normal.18    0.29924    0.89284    0.75974    0.33740    0.83017    0.69799
6  Normal.20    0.27242    0.90849    0.76260    0.27898    0.84248    0.68689
7  Normal.21    0.22222    0.89940    0.72887    0.25004    0.80569    0.69102
8  Normal.22    0.28861    0.89895    0.80707    0.42462    0.86252    0.61141
9  Normal.24    0.43764    0.89720    0.82701    0.35888    0.78328    0.65301
10 Normal.57    0.26827    0.91092    0.73839    0.30372    0.81349    0.66338

有 10 种“正常”类型和 62 种“案例”类型（正常 = 第 1-10 行，案例 = 第 11-62 行）。

我尝试在 16384 个 CpG 站点上运行以下 t 检验，但它只返回 72 个 p 值：

t.result <- apply(data[1:72,], 2, function (x) t.test(x[1:10],x[11:72],paired=FALSE))

data$p_value <- unlist(lapply(t.result, function(x) x$p.value))

data$fdr <- p.adjust(data$p_value, method = "fdr")

任何帮助将不胜感激。

【问题讨论】：

我想你的意思是说 normal = rows 1-10, case = 11-62
所以会有 10*62=620 t-tests 对吧？
@SandipanDey 是的，很好。已编辑以进行更正。

标签： r statistics apply

【解决方案1】：

可能你想要这样的东西：

set.seed(1)
data <- matrix(runif(72*16384), nrow=72) # some random data as surrogate for your original data
indices <- expand.grid(1:10, 11:72) # generate all indices of pairs for t-test
t.result <- apply(indices, 1, function (x) t.test(data[x[1],],data[x[2],],paired=FALSE))
p_values <- unlist(lapply(t.result, function(x) x$p.value))
p_fdr <- p.adjust(p_values, method = "fdr")
hist(p_fdr, col='red', xlim=c(0,1), xlab='p-value', main='Histogram of p-values')
hist(p_values, add=TRUE, col=rgb(0, 1, 0, 0.5))
legend('topleft', legend=c('unadjusted', 'fdr-adjusted'), col=c('red', rgb(0, 1, 0, 0.5)), lwd=2)

正如预期的那样，通过 FDR 调整 p 值几乎消除了所有误报。

【讨论】：