重塑数据框以进行预测答案

【问题标题】：reshaping a dataframe for prediction重塑数据框以进行预测
【发布时间】：2016-03-17 00:31:48
【问题描述】：

我今天刚刚收到包裹reshape，但我很难理解它是如何工作的。

我有以下数据框：

name  workoutnum  time  weight   raceid     final position
tommy      1       12     140       1             2
tommy      2       14     140       1             2 
tommy      3       11     140       1             2
sarah      1       10     115       1             1
sarah      2       10     115       1             1
sarah      3       11     115       1             1
sarah      4       15     115       1             1

我如何将所有这些放在一行中？所以数据框看起来像：

    name  workoutnum1 workoutnum2 workoutnum3 workoutnum4 time1 time2 time3 time4 weight raceid final_position
   tommy     1            1           1           0        12     14   11    NA     140     1           2  
   sarah     1            1           1           1        10     10   11    15     115     1           1

因此所有列都将附加到锻炼值。

这甚至是正确的方法吗？

【问题讨论】：

我不明白你为什么要这样的锻炼，这很接近reshape(dd, dir = 'wide', idvar = c('name','weight','final.position', 'raceid'), timevar = 'workoutnum', v.names = 'time', sep = '')
嗨@rawr - 也许我正在以错误的方式思考这个问题，所以我也欢迎提出更好的数据框的建议，该数据框将包含原始数据中的所有数据，但仍然在一行中显示每个位置每个种族的人。我会尽快尝试您的解决方案！

标签： r reshape reshape2

【解决方案1】：

reshape 似乎是你想做的事情的自然组成部分，但不会让你一直到那里。

这是一个 reshape2 方法，它完全融合数据，然后将其转换回 data.frame，并在此过程中进行一些调整以获得所需的输出。

请注意，在调用 melt() 时，id.vars 参数中的变量将保持宽。然后在dcast() 中，将被广泛转换的变量位于~ 的RHS 上。

library(reshape2)
library(dplyr)

# fully melt the data
d_melt <- melt(d, id.vars = c("name", "raceid", "position", "weight"))
# index the variables within name and variable
d_melt <- d_melt %>%
  group_by(name, variable) %>%
  mutate(i = row_number(),
         wide_variable = paste0(variable, i))

# cast as wide
d_wide <- dcast(d_melt, name + raceid + position + weight ~ wide_variable, value.var = "value")
# replace the workoutnum indices with indicators for missingness 
d_wide %>% mutate_each(funs(ifelse(!is.na(.), 1L, 0L)), matches("workoutnum\\d"))
#    name raceid position weight time1 time2 time3 time4 workoutnum1 workoutnum2
# 1 sarah      1        1    115    10    10    11    15           1           1
# 2 tommy      1        2    140    12    14    11    NA           1           1
#   workoutnum3 workoutnum4
# 1           1           1
# 2           1           0

数据：

structure(list(name = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("sarah", "tommy"), class = "factor"), workoutnum = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), time = c(12L, 14L, 11L, 10L, 10L, 11L, 15L), weight = c(140L, 140L, 140L, 115L, 115L, 115L, 115L), raceid = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), position = c(2L, 2L, 2L, 1L, 1L, 1L, 1L)), .Names = c("name", "workoutnum", "time", "weight", "raceid", "position"), class = "data.frame", row.names = c(NA, -7L))

【讨论】：

嗨@effel，我现在不能尝试这个，但我阅读了代码，它很有意义。我明天试试，结果告诉你
嗨@effel - 这似乎已经奏效，但是当我做出预测时，我得到以下错误Error in object$levels[apply(L, 2, which.max)] : invalid subscript type 'list'这以前没有发生过。您知道可能导致错误的原因吗？
我明白了。如果数据采用您需要的格式，但您在分析的后续步骤中遇到错误，我建议发布一个新问题，并附上该问题的可重现示例。

【解决方案2】：

这是一种使用“data.table”中的dcast 的方法，它的形状更像基础R 中的reshape 函数。

我对数据所做的唯一更改是包含另一个“时间”变量，正如 @rawr 在 cmets 中指出的那样，它几乎看起来就像你的“锻炼次数”是时间变量。

我已使用“splitstackshape”包中的getanID 来生成“时间”变量，但您可以通过多种不同方式创建此变量。

library(splitstackshape)
dcast(getanID(mydf, c("name", "raceid", "final_position")), 
      name + raceid + final_position ~ .id, 
      value.var = c("workoutnum", "time", "weight"))

##     name raceid final_position workoutnum_1 workoutnum_2 workoutnum_3
## 1: sarah      1              1            1            2            3
## 2: tommy      1              2            1            2            3
##    workoutnum_4 time_1 time_2 time_3 time_4 weight_1 weight_2 weight_3 weight_4
## 1:            4     10     10     11     15      115      115      115      115
## 2:           NA     12     14     11     NA      140      140      140       NA

如果你使用getanID，你也可以像这样使用reshape：

reshape(getanID(mydf, c("name", "raceid", "final_position")), 
        idvar = c("name", "raceid", "final_position"), timevar = ".id", 
        direction = "wide")
##     name raceid final_position workoutnum.1 time.1 weight.1 workoutnum.2 time.2
## 1: tommy      1              2            1     12      140            2     14
## 2: sarah      1              1            1     10      115            2     10
##    weight.2 workoutnum.3 time.3 weight.3 workoutnum.4 time.4 weight.4
## 1:      140            3     11      140           NA     NA       NA
## 2:      115            3     11      115            4     15      115

但dcast 通常会更有效率。

【讨论】：