R：根据特定列数是否满足条件对数据进行子集化答案

【问题标题】：R: subsetting data based on whether a condition is met by a specific number of columnsR：根据特定列数是否满足条件对数据进行子集化
【发布时间】：2014-04-17 15:11:16
【问题描述】：

我有一个 log2(expression-values) 基因的数据框，具有以下维度：

>dim(vst.df)
34215 rows and 64 cols

这 64 列指的是 22 个对照 和 42 个案例。行指的是 34215 个基因。

数据框如下所示：

>head(vst.df)[,1:5]
                        sam1      sam2      sam3      sam4      sam5
 ENSG00000000003.10 8.246215  8.671092  8.529269  8.621316  8.415544
 ENSG00000000005.5  5.187977  6.323024  6.022986  5.376513  4.810042
 ENSG00000000419.8  9.654394 10.130017 10.495403 10.209688 10.137285
 ENSG00000000457.9  8.637566  8.604159  8.681583  8.668491  8.874946
 ENSG00000000460.12 7.071433  7.302448  7.499133  7.441582  7.439453
 ENSG00000000938.8  8.713285  8.584996  8.982816  9.787420  8.823927

colnames 是 sampleNames（来自 sam1...sam64），rownames 是geneIDs。哪些 sampleNames 是案例，哪些是控件，由下式给出：

 >head(pData)
 sample_name status  
        sam1   case   
        sam2 contrl  
        sam3 contrl    
        sam4   case  
        sam5   case

datframe vst 中的最小值为：

 >min(vst.df)
 4.10438

我需要过滤数据框 vst.df，以便 EITHER 80% 或更多的所有控件的值 >4.10438 OR 80% 或更多每个基因的病例值>4.10438。

我的做法：

#separate the controls and cases in different dataframes
vst.controls <- vst.df[,which(colnames(vst.df) %in% as.character(pData[which(pData$status=="contrl"),1]))]
vst.cases    <- vst.df[,which(colnames(vst.df) %in% as.character(pData[which(pData$status=="case"),1]))]

#80% of controls is approx. 18
#if 80% or more controls have a value >4.10438, then the rowSums must be > round(4.10438*18)=74
vst.controls <- vst.controls[which(rowSums(vst.controls)>74),]

#similarly for cases
#80% of cases is approx. 34
#if 80% or more cases have a value >4.10438, then the rowSums must be > round(4.10438*34)=140
vst.cases <- vst.cases[which(rowSums(vst.cases)>140),]

实际上我知道我的方法是不正确的，我只是想表明我在在这里发布问题之前已经尝试过一些东西。我该如何解决这个问题？

更新 1：我正在显示控件数据框中的行，因为它比案例小。

#row where 12 columns (<18 columns) meet the condition
vst.controls[6144,]

                    C00060  C00079   C00135  C00150   C00154  C00176  C00182   P01121  P01160  P01165   P01183   P01200   P01202  P01215   P01226   P01248   P01259
ENSG00000129824.11 4.10438 4.10438 4.903374 4.10438 5.051641 4.10438 4.10438 12.64946 4.10438 4.10438 12.14679 12.45381 12.36571 4.10438 12.05378 12.37071 12.22021
                    P01270   P01273  P01277   P01294   P01325
ENSG00000129824.11 4.10438 12.30081 4.10438 13.38687 12.07337

#row where 20 columns (>18 columns) meet the condition
vst.controls[94,]
                   C00060   C00079  C00135   C00150   C00154   C00176  C00182   P01121  P01160  P01165   P01183   P01200   P01202  P01215   P01226   P01248   P01259
ENSG00000005421.4 4.10438 5.439795 5.25585 6.207467 4.810042 5.459054 5.83844 5.573587 4.93365 4.10438 5.660449 5.075977 5.367907 4.74712 5.016934 5.350099 5.098586
                    P01270  P01273   P01277   P01294  P01325
ENSG00000005421.4 5.719316 4.80001 5.431398 5.553477 4.76463

更新 2：

当我使用这个时：

class(vst.controls)
[1] "data.frame"

class(vst.controls[1644,])
[1] "data.frame"

class(vst.controls[94,])
[1] "data.frame"

rowMeans(vst.controls[1644,] > 4.10438) #it returns me the below
ENSG00000084774.9 
                1 

rowMeans(vst.controls[94,] > 4.10438) #it returns me the below
ENSG00000005421.4 
                1

谢谢，

【问题讨论】：

标签： r bioinformatics

【解决方案1】：

一种方法是获取有多少列的值高于阈值。你可以使用rowSums(vst.controls > 4.10438) 来做到这一点，假设除了用于子集的数据之外没有其他列（即vst.controls 正好有 22 列）。然后，条件变为TRUEs 的总和高于案例总数的 80%。对于vst.controls，条件变为

which(rowSums(vst.controls > 4.10438) > 18)    ## 80% of 22 is 17.6

或者更好的是，使用rowMeans 直接计算成功率（与列数无关）：

valid.controls <- which(rowMeans(vst.controls > 4.10438) > 0.8)
valid.cases <- which(rowMeans(vst.cases > 4.10438) > 0.8)
valid <- union(valid.controls, valid.cases)

这将为您提供满足您条件的索引向量。

【讨论】：

我试过你的代码，它返回了所有 34215 行。但是，至少有一行我知道条件不满足并且仍在返回。
您能否从您的数据中显示两行，一个满足条件，另一行不满足？
当我用你的数据运行它时，只有第二种情况会清除条件：rowMeans(dd > 4.10438) 返回0.5454545，其中dd 是第一行。
这怎么可能？当我做同样的事情时，第一行返回 ENSG00000084774.9 1，第二行返回 ENSG00000005421.4 1
尝试让它返回 rowMeans 值，而不是与 0.8 的比较。我得到0.545，它可以预见地在条件下返回FALSE。