标准化样本之间的深度覆盖答案

【问题标题】：Normalizing Depth Coverage among samples标准化样本之间的深度覆盖
【发布时间】：2016-02-26 19:56:01
【问题描述】：

这是一个开放式问题，旨在定义基因组中每个位置（对应于“CpG”位点）的状态，这些位置因样本而异。这个问题的原因是可用的工具定义了 CpG 窗口的状态，而不是单个 CpG。

我有一张这样的桌子：列是：染色体编号（chr）、兴趣基的初始（开始）和最终（结束）位置、预期覆盖范围（深度）、观察到的对不同 6 种动物的覆盖范围（深度1-深度6）。

data <- "chr start end depth depth1 depth2 depth3 depth4 depth5 depth6
chr1 3273 3273 7 200 35 1 200 850 0
chr1 3274 3274 3 50 25 5 300 1500 2
chr1 3275 3275 8 600 15 8 100 300 5
chr1 3276 3276 4 30 2 10 59 20 0
chr1 3277 3277 25 20 7 4 600 45 0"
data <- read.table(text=data, header=T)

我需要用每一行的状态定义一列，状态是：未覆盖区域，交替覆盖和经常覆盖。

为此，首先，我需要对样本之间的深度进行归一化，以获得可以在个体之间进行比较的值。其次，我必须定义状态之间的范围（到目前为止，任何范围都是可以接受的）；

我发现这个脚本做的事情类似于我需要标准化深度，但我还不能应用于我的案例（这个脚本被设计为“CpG 窗口”并显示“C”和“G”的频率在每个窗口中。

setMethod("normalizeCoverage", "methylRawList",

                        function(obj,method){

                          if(method=="median"){
                            x=sapply(obj,function(x) median(x$coverage) )
                          }else if(method=="mean"){
                            x=sapply(obj,function(x) mean(x$coverage) )
                          }else{
                            stop("method option should be either 'mean' or 'median'\n")
                          }
                          sc.fac=max(x)/x #get scaling factor
                          for(i in 1:length(obj)){
                            all.cov=obj[[i]]$coverage
                            fCs    =obj[[i]]$numCs/all.cov
                            fTs    =obj[[i]]$numT/all.cov
                            obj[[i]]$coverage=round(sc.fac[i]*obj[[i]]$coverage)
                            obj[[i]]$numCs   =round(obj[[i]]$coverage*fCs)
                            obj[[i]]$numTs   =round(obj[[i]]$coverage*fTs)
                          }
                          obj

    })

我还检查了 R 的这个“边缘包”函数，它用于 RNAseq 标准化数据，看起来像这样：

calcNormFactors(object, method=c("TMM","RLE","upperquartile","none"), refColumn = NULL,
      logratioTrim = .3, sumTrim = 0.05, doWeighting=TRUE, Acutoff=-1e10, p=0.75)

但我还不能应用到我的数据。

我希望我的最终结果是这样的：

chr start State
chr1 3273 Often
chr1 3274 alternatively
chr1 3275 no
chr1 3276 often
chr1 3277 no

但我真的只对每个样本覆盖的标准化深度感到满意。

【问题讨论】：

标签： r normalization sapply bioconductor

【解决方案1】：

到问题的第一部分（归一化）

“计算深度调整的覆盖率值，将 cpm 函数直接应用于计数矩阵可能就足够了。这会将您的计数转换为每百万计数值，然后您可以在样本之间进行非正式比较”。（Aaron Lun，英国剑桥）

R中的“edgeR”包通过“cpm”标准化

datamatrix <- data [, c(4:10)]
library (edgeR)

#grouping factor
group <- c(1, 2, 2, 2, 2, 2, 2) #groups of each sample)

#create a DGEList
y <- DGEList(counts=datamatrix,group=group)

#########
$counts
  depth depth1 depth2 depth3 depth4 depth5 depth6
1     7    200     35      1    200    850      0
2     3     50     25      5    300   1500      2
3     8    600     15      8    100    300      5
4     4     30      2     10     59     20      0
5    25     20      7      4    600     45      0

$samples
       group lib.size norm.factors
depth      1       47            1
depth1     2      900            1
depth2     2       84            1
depth3     2       28            1
depth4     2     1259            1
depth5     2     2715            1
depth6     2        7            1
##################
#normalize
y <- calcNormFactors(y)

########
> y
An object of class "DGEList"
$counts
  depth depth1 depth2 depth3 depth4 depth5 depth6
1     7    200     35      1    200    850      0
2     3     50     25      5    300   1500      2
3     8    600     15      8    100    300      5
4     4     30      2     10     59     20      0
5    25     20      7      4    600     45      0

$samples
       group lib.size norm.factors
depth      1       47    0.7423735
depth1     2      900    0.5526927
depth2     2       84    0.9534847
depth3     2       28    0.8652676
depth4     2     1259    0.6588115
depth5     2     2715    1.0358307
depth6     2        7    4.3289213
##########################################

> cpm(y)
      depth     depth1    depth2    depth3    depth4     depth5    depth6
1 200621.61  402071.90 436993.56  41275.42 241125.49 302245.841      0.00
2  85980.69  100517.97 312138.26 206377.10 361688.24 533375.014  66001.27
3 229281.84 1206215.69 187282.96 330203.36 120562.75 106675.003 165003.16
4 114640.92   60310.78  24971.06 412754.21  71132.02   7111.667      0.00
5 716505.76   40207.19  87398.71 165101.68 723376.48  16001.250      0.00

标准化！

即使使用归一化，我也有 3 个样本的很多值为零，因为它们的覆盖率很低。我想我必须从分析中删除它们。我想过做一个 PCA 测试来看看这些样本是如何分组的。

我想要一些关于标准化方法的反馈对于我的问题的第二部分

【讨论】：