Bin 行和每个 bin 计算离群值并返回异常值答案

【问题标题】：Bin rows and for each bin compute dispersion and return outliersBin 行和每个 bin 计算离群值并返回异常值
【发布时间】：2016-06-26 00:25:55
【问题描述】：

我有一个相当大的 data.frame，有 12374 行（基因）和 785 列（细胞）。我想根据rowMeans 将行分组到 20 个箱中。在每个箱内，我想对该箱内所有基因的分散测量（方差/平均值）进行 z 归一化，以便识别其表达值高度可变的异常基因，即使与具有相似平均表达的基因相比也是如此。然后我想提取超过 z 分数阈值 1.7 的基因，以从每个 bin 中识别出显着可变的基因。

我的数据看起来像这样：

> head(temp[,1:5])
                         Cell1                Cell2                 Cell3              Cell4                 Cell5
0610007P14RIK            0.1439444            0.0000000             0.000000            0.8759335            0.0000000
0610009B22RIK            0.0000000            0.6776718             0.000000            0.0000000            0.0000000
0610009O20RIK            0.1439444            0.0000000             0.000000            0.2735741            0.0000000
0610010B08RIK            1.4769893            1.1369215             1.124842            0.8759335            1.9544187
0610010F05RIK            0.7944809            0.0000000             0.000000            0.7016789            0.9144108
0610010K14RIK            0.1439444            0.0000000             1.124842            0.7016789            0.0000000

我尝试使用dplyr 来执行此操作，但遇到了与（我认为是）垃圾箱数量相关的错误。这是我的尝试：

library(dplyr)
library(genefilter)
n_bins = 20
temp = data
temp$dispersion = rowMeans(temp)/rowVars(temp)
outscore = temp %>% mutate(bin=ntile(dispersion,n_bins)) %>% 
  group_by(bin) %>% mutate(zscore=scale(dispersion),outlier=abs(zscore)>1.7)

返回的错误是rror: dims [product 619] do not match the length of object [618]

【问题讨论】：

标签： r

【解决方案1】：

修订：这是基于 R 的解决方案，在 dplyr 的帮助下：

library(dplyr)

# I called the data set 'mydata'
colnames(mydata)[1]<-"ID"
a<-which(colnames(mydata)== "ID")

##from: http://www.inside-r.org/packages/cran/metaMA/docs/rowVars

rowVars<-function (x,na.rm = TRUE) 
  {
    sqr = function(x) x * x
    n = rowSums(!is.na(x))
    n[n <= 1] = NA
    return(rowSums(sqr(x - rowMeans(x,na.rm = na.rm)), na.rm = na.rm)/(n - 1))
  }

mydata$dispersion<-rowMeans(mydata[,-a])/rowVars(mydata[,-a])
nbins = 2 # for you, use 20, or however many you want.
mydata$bin<-ntile(mydata$dispersion, nbins)


b<-which(colnames(mydata)== "bin")
temp<-NULL
mydata$Z<-0

for(i in unique(mydata$bin)){
    temp<-mydata[mydata$bin == i, -c(a,b)]$dispersion
    temp<-(temp-mean(temp))/sd(temp)
    mydata[mydata$bin == i, -c(a,b)]$Z<-temp
  }

mydata$outlier<-ifelse(abs(mydata$Z) > 1.7, 1, 0)
mydata.small<-mydata[,c(1,7:10)] ##for display purposes
mydata.small

           ID dispersion bin          Z outlier
0610007P14RIK   1.406851   1 -0.9370254       0
0610009B22RIK   1.475641   1 -0.1158566       0
0610009O20RIK   5.502857   2  0.1333542       0
0610010B08RIK   7.553503   2  0.9266318       0
0610010F05RIK   2.418036   2 -1.0599860       0
0610010K14RIK   1.573546   1  1.0528820       0

【讨论】：