在R中的spread（）函数中使用放置两个值列[重复]答案

【问题标题】：Use put two value columns in spread() function in R [duplicate]在R中的spread（）函数中使用放置两个值列[重复]
【发布时间】：2015-09-16 12:10:42
【问题描述】：

我最近刚刚发布了question，询问如何将数据从长表重塑为宽表。然后我发现spread() 是一个非常方便的函数。所以现在我需要对我之前的帖子进行一些进一步的发展。

假设我们有一个这样的表：

id1   |  id2 |  info  | action_time | action_comment  |
 1    | a    |  info1 |    time1    |        comment1 |
 1    | a    |  info1 |    time2    |        comment2 |
 1    | a    |  info1 |    time3    |        comment3 |
 2    | b    |  info2 |    time4    |        comment4 |
 2    | b    |  info2 |    time5    |        comment5 |

我想把它改成这样：

id1   |  id2 |  info  |action_time 1|action_comment1 |action_time 2|action_comment2 |action_time 3|action_comment3  |
 1    | a    |  info1 |    time1    |      comment1  |    time2    |      comment2  |    time3    |      comment3   |
 2    | b    |  info2 |    time4    |      comment4  |    time5    |      comment5  |             |                 |

所以这个问题和我之前的问题之间的区别是我添加了另一列，我也需要重新调整它。

我正在考虑使用

library(dplyr)
library(tidyr)

df %>% 
  group_by(id1) %>% 
  mutate(action_no = paste("action_time", row_number())) %>%
  spread(action_no, value = c(action_time, action_comment))

但是当我在 value 参数中输入两个值时，它给了我一条错误消息：无效的列规范。

我真的很喜欢使用这种%>% 运算符来操作数据的想法，所以我很想知道如何更正我的代码以实现这一点。

非常感谢您的帮助

【问题讨论】：

标签： r reshape2 tidyr

【解决方案1】：

我们可以使用data.table 的开发版本来做到这一点，它可以采用多个value.var 列。安装devel版本的说明是here

我们将'data.frame'转换为'data.table'（setDT(df)），使用分组变量（'id1'，'id2'，'info'）创建一个序列变量（'ind'），通过将value.var 指定为“action_time”和“action_comment”，将dcast 从“长”格式转换为“宽”格式。

library(data.table)#v1.9.5+
setDT(df)[, ind:= 1:.N, .(id1, id2, info)]
dcast(df, id1 + id2 + info ~ ind,
      value.var=c('action_time', 'action_comment'), fill='')
 #    id1 id2  info 1_action_time 2_action_time 3_action_time 1_action_comment
 #1:   1   a info1         time1         time2         time3         comment1
 #2:   2   b info2         time4         time5                       comment4
 #   2_action_comment 3_action_comment
 #1:         comment2         comment3
 #2:         comment5

或者使用base R 中的reshape。我们使用ave 和reshape 创建序列变量（'ind'），以将“长”格式更改为“宽”格式。

df$ind <- with(df, ave(seq_along(id1), id1, id2, info, FUN=seq_along))
reshape(df, idvar=c('id1', 'id2', 'info'),timevar='ind', direction='wide')
#  id1 id2  info action_time.1 action_comment.1 action_time.2 action_comment.2
#1   1   a info1         time1         comment1         time2         comment2
#4   2   b info2         time4         comment4         time5         comment5
#  action_time.3 action_comment.3
#1         time3         comment3
#4          <NA>             <NA>

数据

df <- structure(list(id1 = c(1L, 1L, 1L, 2L, 2L), id2 = c("a", "a", 
"a", "b", "b"), info = c("info1", "info1", "info1", "info2", 
"info2"), action_time = c("time1", "time2", "time3", "time4", 
"time5"), action_comment = c("comment1", "comment2", "comment3", 
"comment4", "comment5")), .Names = c("id1", "id2", "info", "action_time", 
"action_comment"), class = "data.frame", row.names = c(NA, -5L))

【讨论】：

您介意解释一下：with(df, ave(seq_along(id1), id1, id2, info, FUN=seq_along))。为什么seq_along会出现两次？
@user2540309 这里不需要，但如果有字符/因子列，ave 输出可能是“字符”向量/NA。例如v1 <- rep(letters[1:3],3); ave(v1, v1, FUN=seq_along); v2 <- factor(v1);ave(v2, v2, FUN=seq_along) 在这两种情况下，使用 seq_along 都会给出数字序列。

【解决方案2】：

试试：

library(dplyr)
library(tidyr)

df %>%
  group_by(id1) %>%
  mutate(id = row_number()) %>%
  gather(key, value, -(id1:info), -id) %>%
  unite(id_key, id, key) %>%
  spread(id_key, value)

这给出了：

#Source: local data frame [2 x 9]

#  id1 id2  info 1_action_comment 1_action_time 2_action_comment 2_action_time 3_action_comment 3_action_time
#1   1   a info1         comment1         time1         comment2         time2         comment3         time3
#2   2   b info2         comment4         time4         comment5         time5               NA            NA

【讨论】：

你认为有办法将action_comment1 和action_time1 并排放置，这样我比较容易比较。否则我仍然可以使用这种格式。谢谢
@Lambo 您介意将列命名为 1_action_comment 吗？
没关系。无论如何我可能会更改标题
谢谢史蒂文这太棒了！！！
运行上述代码时出错：Error in match.names(clabs, names(xi)) : names do not match previous names

【解决方案3】：

不是一个直接的解决方案，但有效

library(tidyr)
a = spread(df, action_comment, action_time); 
b = spread(df, action_time, action_comment); 

# dropping NAs and shifting the values to left row wise 
a[] = t(apply(a, 1, function(x) `length<-`(na.omit(x), length(x))))
b[] = t(apply(b, 1, function(x) `length<-`(na.omit(x), length(x))))

out = merge(a,b, by = c('id1','id2','info'))
out[, colSums(is.na(out)) != nrow(out)]

#  id1 id2  info comment1 comment2 comment3    time1    time2    time3
#1   1   a info1    time1    time2    time3 comment1 comment2 comment3
#2   2   b info2    time4    time5     <NA> comment4 comment5     <NA>

【讨论】：