如何将数据从宽格式排列到长格式，并指定关系答案

【问题标题】：How can I arrange data from wide format to long format, and specify relationships如何将数据从宽格式排列到长格式，并指定关系
【发布时间】：2015-11-09 06:08:56
【问题描述】：

目前我有一个需要从宽格式转换为长格式的文件。数据示例为：

Subject,Cat1_Weight,Cat2_Weight,Cat3_Weight,Cat1_Sick,Cat2_Sick,Cat3_Sick
1,10,11,12,1,0,0
2,7,8,9,1,0,0

但是，我需要如下长格式

Subject,CatNumber,Weight,Sickness
1,1,10,1
1,2,11,0
1,3,12,0
2,1,7,1
2,2,8,0
2,3,9,0

到目前为止，我已经尝试在 R 中使用 melt 功能

datalong <- melt(exp2_simon_shortform, id ="Subject")

但它将每个列名视为一个唯一变量，每个变量都有自己的值。有谁知道我如何从指定的宽到长，参考列标题名称？

干杯。

编辑：我意识到我犯了一个错误。我的最终输出需要如下。所以从 Cat1_ 部分，我实际上需要取出“Cat”和“1”

Subject Animal  CatNumber   Weight  Sickness
1   Cat 1   10  1
1   Cat 2   11  0
1   Cat 3   12  0
2   Cat 1   7   1
2   Cat 2   8   0
2   Cat 3   9   0

非常感谢任何更新的解决方案。

【问题讨论】：

是的，老实说，这是对问题的一个相对较小的调整。我想寻找这个问题的解决方案的人不会受到我的更新的阻碍。但为此道歉，我是 Stack 规范的新手。
没问题，我更新了解决方案。

标签： r reshape

【解决方案1】：

“dplyr”+“tidyr”方法可能类似于：

library(dplyr)
library(tidyr)
mydf %>%
  gather(var, val, -Subject) %>%
  separate(var, into = c("CatNumber", "variable")) %>%
  spread(variable, val) 
#   Subject CatNumber Sick Weight
# 1       1      Cat1    1     10
# 2       1      Cat2    0     11
# 3       1      Cat3    0     12
# 4       2      Cat1    1      7
# 5       2      Cat2    0      8
# 6       2      Cat3    0      9

在其中添加 mutate 和 gsub 以删除“CatNumber”列的“Cat”部分。

更新

基于the discussions in chat，您的数据实际上看起来更像：

A = c("ATCint", "Blank", "None"); B = 1:5; C = c("ResumptionTime", "ResumptionMisses")

colNames <- expand.grid(A, B, C)
colNames <- sprintf("%s%d_%s", colNames[[1]], colNames[[2]], colNames[[3]])

subject = 1:60

set.seed(1)
M <- matrix(sample(10, length(subject) * length(colNames), TRUE), 
            nrow = length(subject), dimnames = list(NULL, colNames))

mydf <- data.frame(Subject = subject, M)

因此，您需要执行一些额外的步骤才能获得所需的输出。试试：

library(dplyr)
library(tidyr)
mydf %>% 
  group_by(Subject) %>%                    ## Your ID variable
  gather(var, val, -Subject) %>%           ## Make long data. Everything except your IDs
  separate(var, into = c("partA", "partB")) %>%  ## Split new column into two parts
  mutate(partA = gsub("(.*)([0-9]+)", "\\1_\\2", partA)) %>% ## Make new col easy to split
  separate(partA, into = c("A1", "A2")) %>%                  ## Split this new column
  spread(partB, val)                                         ## Transform to wide form

产量：

Source: local data frame [900 x 5]

   Subject     A1    A2 ResumptionMisses ResumptionTime
     (int)  (chr) (chr)            (int)          (int)
1        1 ATCint     1                9              3
2        1 ATCint     2                4              3
3        1 ATCint     3                2              2
4        1 ATCint     4                7              4
5        1 ATCint     5                7              1
6        1  Blank     1                4             10
7        1  Blank     2                2              4
8        1  Blank     3                7              5
9        1  Blank     4                1              9
10       1  Blank     5               10             10
..     ...    ...   ...              ...            ...

【讨论】：

哇.. 100K！这很快就出现了！
Google 告诉我派对将持续到下周 :-)
非常感谢您的回复。现在只是想破译它。 \\2_\\1" 在merged.stack 中有什么作用？
@MichaelAnderson，在gsub 步骤中，这基本上将您的列名从“Cat1_Weight”更改为“Weight_Cat1”。 "(.*)_(.*)" 表示要查找直到下划线的任何内容组，然后是另一组任何内容。然后我们先将其替换为第二组 (\\2)，然后是下划线，然后是第一组 (\\1)。希望这是有道理的:-)
@ananda-mahto 谢谢，很抱歉，您能看看我的问题的编辑，看看您是否可以调整解决方案？

【解决方案2】：

您可以使用 base reshape 来做到这一点，例如：

reshape(dat, idvar="Subject", direction="long", varying=list(2:4,5:7),
        v.names=c("Weight","Sick"), timevar="CatNumber")

#    Subject CatNumber Weight Sick
#1.1       1         1     10    1
#2.1       2         1      7    1
#1.2       1         2     11    0
#2.2       2         2      8    0
#1.3       1         3     12    0
#2.3       2         3      9    0

另外，由于 reshape 需要像 variablename_groupname 这样的名称，您可以更改名称，然后重新调整以完成艰苦的工作：

names(dat) <- gsub("Cat(.+)_(.+)", "\\2_\\1", names(dat))
reshape(dat, idvar="Subject", direction="long", varying=-1, 
        sep="_", timevar="CatNumber")

#    Subject CatNumber Weight Sick
#1.1       1         1     10    1
#2.1       2         1      7    1
#1.2       1         2     11    0
#2.2       2         2      8    0
#1.3       1         3     12    0
#2.3       2         3      9    0

【讨论】：

你强迫我写一个“dplyr”的答案。 +1
为响应干杯。在更改名称方面 - 对于我的真实变量名称，我应该怎么做：“ATCint[1:5]_ResumptionTime”“Blank[1:5]_ResumptionTime”“None[1:5]_ResumptionTime”“ATCint [1:5]_ResumptionMisses" "Blank[1:5]_ResumptionMisses" "None[1:5]_ResumptionMisses" 我刚刚意识到我标记所有变量 cat 的错误。
@MichaelAnderson，这些是您的实际列名，还是[1:5] 表示您有五个这样的列？
是的，有 5 个这样的列。很抱歉。
@MichaelAnderson，所以你从 30 个变量列开始，加上 1 个或多个 id 列，你想以 id 列 +“ResumptionTime”+“ResumptionMisses”+是否“ATCint”结束， “空白”或“无”，然后是数字和值？我只是想澄清你实际上在处理什么。

【解决方案3】：

我们可以使用library(data.table) 中的melt，它可以为measure 变量取多个patterns。

library(data.table)#v1.9.6+
DT <- melt(setDT(df1), measure=patterns('Weight$', 'Sick$'), 
            variable.name='CatNumber', value.name=c('Weight', 'Sick'))[order(Subject)]
DT 
#   Subject CatNumber Weight Sick
#1:       1         1     10    1
#2:       1         2     11    0
#3:       1         3     12    0
#4:       2         1      7    1
#5:       2         2      8    0
#6:       2         3      9    0

如果我们需要'Animal'列，我们可以grep为'Cat'列并删除带有sub的后缀子字符串，分配（:=）它以创建'Animal'列。

DT[, Animal := sub('\\d+\\_.*', '', grep('Cat', colnames(df1), value=TRUE))]

DT
#   Subject CatNumber Weight Sick Animal
#1:       1         1     10    1    Cat
#2:       1         2     11    0    Cat
#3:       1         3     12    0    Cat
#4:       2         1      7    1    Cat
#5:       2         2      8    0    Cat
#6:       2         3      9    0    Cat

【讨论】：