【问题标题】:match rows, create ranks and sum ranks by group in R匹配行,在 R 中按组创建排名和求和排名
【发布时间】:2017-09-28 13:36:15
【问题描述】:

我有一个包含约 30,000 行和约 17,000 列的庞大数据集,以及一个包含 character 元素的向量。

这是一个重新创建我的数据集的虚拟集

### Example

df <- data.frame(Gene=paste0("gene", 1:60), replicate(60, runif(60, min=0, max=100)))
colnames(df) <- c("GeneName", paste0("TisA.", 1:20), paste0("TisB.", 1:20), paste0("TisC.", 1:20))

genes <- sample(df$GeneName, 5)

head(df)
#      GeneName    TisA.1    TisA.2    TisA.3   TisA.4
#1    gene1  1.987621 17.936562 18.145417 59.43023
#2    gene2 60.031713 73.822846 93.946769 72.27633
#3    gene3 44.833748 47.890719 77.100497 39.45719
#4    gene4 44.662776 26.285659 30.087606 49.50682
#5    gene5 63.770411  6.469006  3.797708 68.17532

我需要为数据框匹配向量中的元素,这很容易完成

 df.new <- df[df$GeneName %in% genes,]

然后,我想要的是,对于每个genes,为每个基因创建等级值,然后将等级与Tis (A, B, C) 相加

例如,我可以使用 gene 对值进行排序

genes.ord <- sort(df.new[1,], decreasing = TRUE)

但是,我被困在这里,这将是为基因分配等级并按组求和这些等级的最快方法,即TisATisBTisC

为澄清起见,每组有 20 个样本TisA.1, TisA.2, ..., TisA.20

期望的输出是:

 GeneName   TisA TisB TisC
    gene4     24   32   10 ## these are random values to show sum of ranks for each of genes in the vector
    gene1     14   12   20 ## these are random values to show sum of ranks for each of genes in the vector
   gene40      4   92   12 ## these are random values to show sum of ranks for each of genes in the vector
    gene2     64    2   40 ## these are random values to show sum of ranks for each of genes in the vector
   gene15     84   32    9 ## these are random values to show sum of ranks for each of genes in the vector

P.S 我的真实数据集中的一些值可以是 0 并且在不同的列中重复

【问题讨论】:

  • 你在说什么类型的“组”?你的基因被标记为 1-60,你有 60 行。
  • “组”将是“TisA”、“TisB”或“TisC”,每个都有 20 个元素,例如"TisA.1", "TissA.2",...TisA.20"

标签: r dataframe


【解决方案1】:

直接使用 tidyverse

# your data. Including seed to make it reproducible
set.seed(123)
df <- data.frame(Gene=paste0("gene", 1:60), replicate(60, runif(60, min=0, max=100)))
colnames(df) <- c("GeneName", paste0("TisA.", 1:20), paste0("TisB.", 1:20), paste0("TisC.", 1:20))

library(tidyverse)
as.tbl(df) %>% 
    gather(key, value, -GeneName) %>% 
    group_by(GeneName) %>% 
    mutate(Ranks = rank(value, ties.method = "first"))  %>% 
    separate(key, into = c("key1", "key2"), sep = "[.]") %>% 
    group_by(GeneName,key1) %>% 
    summarise(Sum=sum(Ranks)) %>% 
    spread(key1, Sum)
# A tibble: 60 x 4
# Groups:   GeneName [60]
GeneName  TisA  TisB  TisC
*   <fctr> <int> <int> <int>
1    gene1   698   620   512
2   gene10   525   653   652
3   gene11   631   598   601
4   gene12   487   679   664
5   gene13   688   579   563
6   gene14   674   581   575
7   gene15   618   647   565
8   gene16   696   552   582
9   gene17   656   560   614
10  gene18   543   649   638 

或者试试 baseR 解决方案...有点复杂

df1 <- apply(df[-1], 1, rank, ties.method= "first")
df2 <- apply(df1, 2, function(x){
  aggregate(x, list(sapply(strsplit(colnames(df), "[.]"), "[", 1)[-1]), sum)
  })
df3 <- cbind.data.frame(df$GeneName, t(Reduce(cbind, lapply(df2, "[", 2))))
colnames(df3) <- c("GeneName",  "TisA", "TisB", "TisC")
head(df3[order(df3$GeneName),])
GeneName TisA TisB TisC
   gene1  698  620  512
  gene10  525  653  652
  gene11  631  598  601
  gene12  487  679  664
  gene13  688  579  563
  gene14  674  581  575

【讨论】:

  • 感谢 Jimbou,如果 colnames 组信息(即“TisA.1”、“TisA.2”)存储在 data.frame 中,而我的数据集中的真实列将是一种更简单的方法是字母和数字的组合吗?
猜你喜欢
  • 2015-08-06
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多