【发布时间】:2014-04-17 15:11:16
【问题描述】:
我有一个 log2(expression-values) 基因的数据框,具有以下维度:
>dim(vst.df)
34215 rows and 64 cols
这 64 列指的是 22 个对照 和 42 个案例。行指的是 34215 个基因。
数据框如下所示:
>head(vst.df)[,1:5]
sam1 sam2 sam3 sam4 sam5
ENSG00000000003.10 8.246215 8.671092 8.529269 8.621316 8.415544
ENSG00000000005.5 5.187977 6.323024 6.022986 5.376513 4.810042
ENSG00000000419.8 9.654394 10.130017 10.495403 10.209688 10.137285
ENSG00000000457.9 8.637566 8.604159 8.681583 8.668491 8.874946
ENSG00000000460.12 7.071433 7.302448 7.499133 7.441582 7.439453
ENSG00000000938.8 8.713285 8.584996 8.982816 9.787420 8.823927
colnames 是 sampleNames(来自 sam1...sam64),rownames 是geneIDs。哪些 sampleNames 是案例,哪些是控件,由下式给出:
>head(pData)
sample_name status
sam1 case
sam2 contrl
sam3 contrl
sam4 case
sam5 case
datframe vst 中的最小值为:
>min(vst.df)
4.10438
我需要过滤数据框 vst.df,以便 EITHER 80% 或更多的所有控件的值 >4.10438 OR 80% 或更多每个基因的病例值>4.10438。
我的做法:
#separate the controls and cases in different dataframes
vst.controls <- vst.df[,which(colnames(vst.df) %in% as.character(pData[which(pData$status=="contrl"),1]))]
vst.cases <- vst.df[,which(colnames(vst.df) %in% as.character(pData[which(pData$status=="case"),1]))]
#80% of controls is approx. 18
#if 80% or more controls have a value >4.10438, then the rowSums must be > round(4.10438*18)=74
vst.controls <- vst.controls[which(rowSums(vst.controls)>74),]
#similarly for cases
#80% of cases is approx. 34
#if 80% or more cases have a value >4.10438, then the rowSums must be > round(4.10438*34)=140
vst.cases <- vst.cases[which(rowSums(vst.cases)>140),]
实际上我知道我的方法是不正确的,我只是想表明我在在这里发布问题之前已经尝试过一些东西。我该如何解决这个问题?
更新 1:我正在显示控件数据框中的行,因为它比案例小。
#row where 12 columns (<18 columns) meet the condition
vst.controls[6144,]
C00060 C00079 C00135 C00150 C00154 C00176 C00182 P01121 P01160 P01165 P01183 P01200 P01202 P01215 P01226 P01248 P01259
ENSG00000129824.11 4.10438 4.10438 4.903374 4.10438 5.051641 4.10438 4.10438 12.64946 4.10438 4.10438 12.14679 12.45381 12.36571 4.10438 12.05378 12.37071 12.22021
P01270 P01273 P01277 P01294 P01325
ENSG00000129824.11 4.10438 12.30081 4.10438 13.38687 12.07337
#row where 20 columns (>18 columns) meet the condition
vst.controls[94,]
C00060 C00079 C00135 C00150 C00154 C00176 C00182 P01121 P01160 P01165 P01183 P01200 P01202 P01215 P01226 P01248 P01259
ENSG00000005421.4 4.10438 5.439795 5.25585 6.207467 4.810042 5.459054 5.83844 5.573587 4.93365 4.10438 5.660449 5.075977 5.367907 4.74712 5.016934 5.350099 5.098586
P01270 P01273 P01277 P01294 P01325
ENSG00000005421.4 5.719316 4.80001 5.431398 5.553477 4.76463
更新 2:
当我使用这个时:
class(vst.controls)
[1] "data.frame"
class(vst.controls[1644,])
[1] "data.frame"
class(vst.controls[94,])
[1] "data.frame"
rowMeans(vst.controls[1644,] > 4.10438) #it returns me the below
ENSG00000084774.9
1
rowMeans(vst.controls[94,] > 4.10438) #it returns me the below
ENSG00000005421.4
1
谢谢,
【问题讨论】:
标签: r bioinformatics