如何基于行分组进行成对除法答案

【问题标题】：How to perform pairwise division based on row grouping如何基于行分组进行成对除法
【发布时间】：2016-08-24 10:11:56
【问题描述】：

我有一个按以下方式制作的数据框：

df <- structure(list(celltype = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L), .Label = c("Bcells",
"DendriticCells", "Macrophages", "Monocytes", "NKCells", "Neutrophils",
"StemCells", "StromalCells", "abTcells", "gdTCells"), class = "factor"),
    sample = c("SP ID control", "SP ID treated", "SP ID control",
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control",
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control",
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control",
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control",
    "SP ID treated"), `mean(score)` = c(0.160953535029424, 0.155743474395545,
    0.104788051104575, 0.125247035158472, -0.159665650045289,
    -0.134662049979712, 0.196249441751866, 0.212256889027029,
    0.0532668251890109, 0.0738264693971133, 0.151828478029596,
    0.159941552142933, -0.14128323638966, -0.120556640790534,
    0.196518649474078, 0.185264282171863, 0.0654641151966543,
    0.0837989059507186, 0.145111577618456, 0.145448549866796)), .Names = c("celltype",
"sample", "mean(score)"), row.names = c(7L, 8L, 17L, 18L, 27L,
28L, 37L, 38L, 47L, 48L, 57L, 58L, 67L, 68L, 77L, 78L, 87L, 88L,
97L, 98L), class = "data.frame")

看起来像这样：

> df
         celltype        sample mean(score)
7          Bcells SP ID control  0.16095354
8          Bcells SP ID treated  0.15574347
17 DendriticCells SP ID control  0.10478805
18 DendriticCells SP ID treated  0.12524704
27    Macrophages SP ID control -0.15966565
28    Macrophages SP ID treated -0.13466205
37      Monocytes SP ID control  0.19624944
38      Monocytes SP ID treated  0.21225689
47        NKCells SP ID control  0.05326683
48        NKCells SP ID treated  0.07382647
57    Neutrophils SP ID control  0.15182848
58    Neutrophils SP ID treated  0.15994155
67      StemCells SP ID control -0.14128324
68      StemCells SP ID treated -0.12055664
77   StromalCells SP ID control  0.19651865
78   StromalCells SP ID treated  0.18526428
87       abTcells SP ID control  0.06546412
88       abTcells SP ID treated  0.08379891
97       gdTCells SP ID control  0.14511158
98       gdTCells SP ID treated  0.14544855

我想要做的是根据cell type 分组中的treated 和control 样本计算分数划分。

以下 Excel 图像说明了该示例。我们在最右边的列之后。例如在 Bcells 中 (0.155/0.161 = 0.967)。

在一天结束时，我想得到如下所示的 df：

celltype            sample          Pairwise division
Bcells              SP ID treated   0.967630031
DendriticCells      SP ID treated   1.195241574
Macrophages         SP ID treated   0.843400255
Monocytes           SP ID treated   1.081566841
NKCells             SP ID treated   1.385974647
Neutrophils         SP ID treated   1.053435786
StemCells           SP ID treated   0.853297563
StromalCells        SP ID treated   0.942731303
abTcells            SP ID treated   1.280073915
gdTCells            SP ID treated   1.002322158

如何在 R 中实现这一点？

【问题讨论】：

标签： r aggregate

【解决方案1】：

如果您的数据是有序且完全配对的：

pair_index <- 1:(dim(df)[1]/2)*2
df[pair_index,'pairwise-division'] <- df[pair_index,3] / df[pair_index-1,3]
df[pair_index,c(1,2,4)]

【讨论】：

【解决方案2】：

如果你传播到广泛的形式，那就很简单了：

library(tidyr)
library(dplyr)

df %>% spread(sample, `mean(score)`) %>% 
    mutate(pairwise_division = `SP ID treated` / `SP ID control`)

##          celltype SP ID control SP ID treated pairwise_division
## 1          Bcells    0.16095354    0.15574347         0.9676300
## 2  DendriticCells    0.10478805    0.12524704         1.1952416
## 3     Macrophages   -0.15966565   -0.13466205         0.8434003
## 4       Monocytes    0.19624944    0.21225689         1.0815668
## 5         NKCells    0.05326683    0.07382647         1.3859746
## 6     Neutrophils    0.15182848    0.15994155         1.0534358
## 7       StemCells   -0.14128324   -0.12055664         0.8532976
## 8    StromalCells    0.19651865    0.18526428         0.9427313
## 9        abTcells    0.06546412    0.08379891         1.2800739
## 10       gdTCells    0.14511158    0.14544855         1.0023222

请注意，您可能应该修正列名，这样您就不必经常使用反引号。

要准确地获得所需的结果，请收集回 long，过滤到刚刚处理过的行，然后选择所需的列：

df %>% spread(sample, `mean(score)`) %>% 
    mutate(pairwise_division = `SP ID treated` / `SP ID control`) %>% 
    gather(sample, `mean(score)`, starts_with('SP')) %>% 
    filter(sample == 'SP ID treated') %>% 
    select(celltype, sample, pairwise_division)

##          celltype        sample pairwise_division
## 1          Bcells SP ID treated         0.9676300
## 2  DendriticCells SP ID treated         1.1952416
## 3     Macrophages SP ID treated         0.8434003
## 4       Monocytes SP ID treated         1.0815668
## 5         NKCells SP ID treated         1.3859746
## 6     Neutrophils SP ID treated         1.0534358
## 7       StemCells SP ID treated         0.8532976
## 8    StromalCells SP ID treated         0.9427313
## 9        abTcells SP ID treated         1.2800739
## 10       gdTCells SP ID treated         1.0023222

如果您愿意，可以在 base 和 data.table 中使用等效版本。或者直接走：

aggregate(cbind(pairwise_division = `mean(score)`) ~ celltype, 
          df[order(df$celltype, df$sample), ], 
          FUN = function(x){x[2]/x[1]})

##          celltype pairwise_division
## 1          Bcells         0.9676300
## 2  DendriticCells         1.1952416
## 3     Macrophages         0.8434003
## 4       Monocytes         1.0815668
## 5         NKCells         1.3859746
## 6     Neutrophils         1.0534358
## 7       StemCells         0.8532976
## 8    StromalCells         0.9427313
## 9        abTcells         1.2800739
## 10       gdTCells         1.0023222

【讨论】：

谢谢，但是为什么结果第一行的值不是0.967630031？
糟糕，向后划分并发布了错误的版本。固定。