【问题标题】：How to reorganize a dataframe in R如何在 R 中重新组织数据框
【发布时间】：2014-03-28 20:24:57
【问题描述】：

我使用 read.table() 将 CSV 文件导入到 data.frame。 data.frame 看起来像：

X1        X2   X3
Sample    A  
Lot      new
Name     Vol   %
Data     0.1   10
Data     0.2   20
Data     0.3   30
Sample    B  
Lot      old
Name     Vol   %
Data     0.1   50
Data     0.2   60
Data     0.3   70

我想重新组织这个data.frame，使前 3 个数据点与样本“A”和批次“新”相关联，而后三个数据点与样本“B”和批次“旧”相关联。我试图想出一种优雅的方法来做到这一点，而无需使用 for 循环，也不必使用子集命令（即dataA = mydataframe[4:6]，）逐行手动雕刻出data.frame。

我最终想要的data.frame 可能看起来像：

A_new_Vol  A_new_%   B_old_Vol   B_old_%
  0.1        10         0.1        50
  0.2        20         0.2        60
  0.3        30         0.3        70

其中 Sample、Lot、Vol 和 % 信息合并到列名本身中。

另一种可能性是让data.frame 类似于：

Sample   Lot   Vol   %
  A      new   0.1   10
  A      new   0.2   20
  A      new   0.3   30
  B      old   0.1   50
  B      old   0.2   60
  B      old   0.3   70

任何指针将不胜感激。谢谢！

【问题讨论】：

标签： r csv dataframe

【解决方案1】：

假设你的数据在df:

df <- setNames(df[-1, ], c("type", "Vol", "%"))
df.lst <- split(df, cumsum(df[, 1] == "Sample"))
do.call(
  rbind,
  lapply(df.lst, function(x) cbind(Sample=x[1, 2], Lot=x[2, 2], x[-(1:3), -1]))
)

生产（最后以dput 提供）：

     Sample Lot Vol  %
1.5       A new 0.1 10
1.6       A new 0.2 20
1.7       A new 0.3 30
2.11      B old 0.1 50
2.12      B old 0.2 60
2.13      B old 0.3 70

如果您想要其他格式，可以使用reshape2 选项：

library(reshape2)
df.new$id2 <- ave(1:nrow(df.new), df.new$Sample, df.new$Lot, FUN=seq_along)
dcast(
  melt(df.new, id.vars=c("Sample", "Lot", "id2")), 
  id2 ~ Sample + Lot + variable
)

生产：

  id2 A_new_Vol A_new_% B_old_Vol B_old_%
1   1       0.1      10       0.1      50
2   2       0.2      20       0.2      60
3   3       0.3      30       0.3      70

基本上，您需要添加一个 id 列，再熔化一次，这样您就真正处于“长”格式，然后将dcast 转换为宽格式。

或者，如果您想要基础 R，您也可以使用（由 Ananda 提供）：

df.new <- within(df.new, {
  ID <- ave(rep(1, nrow(df.new)), Sample, FUN = seq_along)
  Time <- paste(Sample, Lot, sep = "_")
})
reshape(df.new, direction = "wide", idvar="ID", timevar="Time", drop=c("Sample", "Lot"))

导致：

    ID Vol.A_new %.A_new Vol.B_old %.B_old
1.4  1       0.1      10       0.1      50
1.5  2       0.2      20       0.2      60
1.6  3       0.3      30       0.3      70

df.new 开头为：

structure(list(Sample = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), Lot = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("new", "old"), class = "factor"), Vol = c(0.1, 0.2, 0.3, 0.1, 0.2, 0.3), "%" = c(10L, 20L, 30L, 50L, 60L, 70L), id2 = c(1L, 2L, 3L, 1L, 2L, 3L)), .Names = c("Sample", "Lot", "Vol", "%", "id2"), row.names = c("1.5", "1.6", "1.7", "2.11", "2.12", "2.13"), class = "data.frame")

【讨论】：

很好地使用cumsum。
如果您只是在“df.new”中直接将“Sample”和“Lot”粘贴在一起而不是创建“df.new”来创建timevar，那么您的基本reshape 解决方案会更简单。 mlt" 对象。
@AnandaMahto，我不确定variable 直到“融化”之后才存在。我在答案中添加了df.new 的dput。请注意，我仍然订阅“try-every-permutation-of-every-parameter-in-reshape-until-something-approaching-desired-output-emerges”学派，所以如果可以的话我会很高兴告诉我怎么做。我尝试了您的建议，但得到的东西不太正确。
@BrodieG，我是reshape 的粉丝，因为这是我开始的。因此，它对我来说似乎并不太模糊。也许作者试图为“宽”和“长”变体提供一个功能而犯了一个错误。无论如何，这就是我的处理方式：r-fiddle.org/#/fiddle?id=8yhSfBTL
@AnandaMahto，太好了，谢谢。我对reshape 的问题主要是我还没有真正坐下来弄清楚它是如何工作的，因为我发现reshape2 更加直观。

【解决方案2】：

prev_sample_indices <- which(df[[1]] == 'Sample')
sample_indices <- c(prev_sample_indices[-1], nrow(df) + 1)

df <- Reduce(cbind, lapply(seq_along(sample_indices), function(index) {
  sample_index <- prev_sample_indices[index]
  label <- df[sample_index, 2] # A or B
  lot <- df[sample_index + 1, 2] # old or new
  data.frame(structure(lapply(2:3, function(i)
    df[seq(sample_index + 3, sample_indices[index] - 1), i]
  ), .Names = paste0(label, "_", lot, "_", c("Vol", "pct"))))                       
}))

示例

 df <- data.frame(c("Sample", "Lot", "Name", "Data", "Data", "Data", "Sample", "Lot", "Name", "Data", "Data", "Data"), c("A", "new", "Vol", (1:3)/10, "B", "old", "Vol", (1:3)/10), c("", "", "%", (1:3)*10, "", "", "%", (5:7)*10))
 colnames(df) <- paste0("X", 1:3)
 # Run above code
 print(df)
 #   A_new_Vol A_new_pct B_old_Vol B_old_pct
 # 1       0.1        10       0.1        50
 # 2       0.2        20       0.2        60
 # 3       0.3        30       0.3        70

请注意，您不能在 data.frame 的列名中使用 %。它被转换为.。

【讨论】：