这里的问题是melt() 不知道如何命名变量以防多个度量变量。所以,它只是简单地给变量编号。
David 指出有一个feature request。但是,我将展示两种解决方法并在速度方面比较它们(加上the tidyr answer)。
- 第一种方法是
melt()所有测量变量(保留变量名称),创建新变量名称,然后再次dcast()临时结果以两个值列结束。这种重铸方法也被austensen使用。
- 第二种方法是 OP 所要求的(同时融合两个值列),但包括一种之后重命名变量的简单方法。
重铸
library(data.table) # CRAN version 1.10.4 used
# melt all measure variables
long <- melt(df, id.vars = "id")
# split variables names
long[, c("CapitalChargeType", "age") :=
tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)]
dcast(long, id + CapitalChargeType ~ age)
id CapitalChargeType New old
1: 1 Credit_risk_Capital 204.85227 327.57606
2: 1 NameConcentration 34.20043 104.14524
3: 2 Credit_risk_Capital 206.96769 416.64575
4: 2 NameConcentration 30.46721 95.25282
5: 3 Credit_risk_Capital 201.85514 465.06647
---
196: 98 NameConcentration 45.38833 90.34097
197: 99 Credit_risk_Capital 203.53625 458.37501
198: 99 NameConcentration 40.14643 101.62655
199: 100 Credit_risk_Capital 203.19156 527.26703
200: 100 NameConcentration 30.83511 79.21762
请注意,变量名称在最后一个_ 之前拆分为最后一个old 或New,分别。这是通过使用带有正向预测的正则表达式来实现的:"_(?=(New|old)$)"
融合两列并重命名变量
这里,我们接David's suggestion使用patterns()函数,相当于指定一个度量变量列表。
附带说明:列表(或模式)的顺序决定了值列的顺序:
melt(df, measure.vars = patterns("New$", "old$"))
id variable value1 value2
1: 1 1 204.85227 327.57606
2: 2 1 206.96769 416.64575
3: 3 1 201.85514 465.06647
...
melt(df, measure.vars = patterns("old$", "New$"))
id variable value1 value2
1: 1 1 327.57606 204.85227
2: 2 1 416.64575 206.96769
3: 3 1 465.06647 201.85514
...
正如 OP 已经指出的那样,融合了多个度量变量
long <- melt(df, measure.vars = patterns("old$", "New$"),
variable.name = "CapitalChargeType",
value.name = c("old", "New"))
返回数字而不是变量名:
str(long)
Classes ‘data.table’ and 'data.frame': 200 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ CapitalChargeType: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ old : num 328 417 465 259 426 ...
$ New : num 205 207 202 207 203 ...
- attr(*, ".internal.selfref")=<externalptr>
幸运的是,这些因素可以通过在 forcats 包的帮助下替换因子水平轻松更改:
long[, CapitalChargeType := forcats::lvls_revalue(
CapitalChargeType,
c("Credit_risk_Capital", "NameConcentration"))]
long[order(id)]
id CapitalChargeType old New
1: 1 Credit_risk_Capital 327.57606 204.85227
2: 1 NameConcentration 104.14524 34.20043
3: 2 Credit_risk_Capital 416.64575 206.96769
4: 2 NameConcentration 95.25282 30.46721
5: 3 Credit_risk_Capital 465.06647 201.85514
---
196: 98 NameConcentration 90.34097 45.38833
197: 99 Credit_risk_Capital 458.37501 203.53625
198: 99 NameConcentration 101.62655 40.14643
199: 100 Credit_risk_Capital 527.26703 203.19156
200: 100 NameConcentration 79.21762 30.83511
请注意,melt() 按列在df 中出现的顺序对变量进行编号。
reshape()
基础 R 的 stats 包有一个 reshape() 函数。不幸的是,它不接受具有正向预测的正则表达式。所以,不能使用变量名的自动猜测。相反,所有相关参数都必须明确指定:
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
reshape(df, varying = list(old, new), direction = "long",
timevar = "CapitalChargeType",
times = c("Credit_risk_Capital", "NameConcentration"),
v.names = c("old", "New"))
id CapitalChargeType old New
1: 1 Credit_risk_Capital 367.95567 194.93598
2: 2 Credit_risk_Capital 467.98061 215.39663
3: 3 Credit_risk_Capital 363.75586 201.72794
4: 4 Credit_risk_Capital 433.45070 191.64176
5: 5 Credit_risk_Capital 408.55776 193.44071
---
196: 96 NameConcentration 93.67931 47.85263
197: 97 NameConcentration 101.32361 46.94047
198: 98 NameConcentration 104.80926 33.67270
199: 99 NameConcentration 101.33178 32.28041
200: 100 NameConcentration 85.37136 63.57817
基准测试
基准包括目前讨论的所有 4 种方法:
-
tidyr 修改为使用正则表达式和正向预测,
-
recast,
-
melt() 的多个值变量,以及
-
reshape()。
基准数据由 10 万行组成:
n_rows <- 100L
set.seed(1234L)
df <- data.table(
id = c(1:n_rows),
Credit_risk_Capital_old = rnorm(n_rows, mean = 400, sd = 60),
NameConcentration_old = rnorm(n_rows, mean = 100, sd = 10),
Credit_risk_Capital_New = rnorm(n_rows, mean = 200, sd = 10),
NameConcentration_New = rnorm(n_rows, mean = 40, sd = 10))
对于基准测试,使用microbenchmark 包:
library(magrittr)
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
microbenchmark::microbenchmark(
tidyr = {
r_tidyr <- df %>%
dplyr::as_data_frame() %>%
tidyr::gather("key", "value", -id) %>%
tidyr::separate(key, c("CapitalChargeType", "age"), sep = "_(?=(New|old)$)") %>%
tidyr::spread(age, value)
},
recast = {
r_recast <- dcast(
melt(df, id.vars = "id")[
, c("CapitalChargeType", "age") :=
tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)],
id + CapitalChargeType ~ age)
},
m2col = {
r_m2col <- melt(df, measure.vars = patterns("New$", "old$"),
variable.name = "CapitalChargeType",
value.name = c("New", "old"))[
, CapitalChargeType := forcats::lvls_revalue(
CapitalChargeType,
c("Credit_risk_Capital", "NameConcentration"))][order(id)]
},
reshape = {
r_reshape <- reshape(df, varying = list(new, old), direction = "long",
timevar = "CapitalChargeType",
times = c("Credit_risk_Capital", "NameConcentration"),
v.names = c("New", "old")
)
},
times = 10L
)
Unit: milliseconds
expr min lq mean median uq max neval
tidyr 705.20364 789.63010 832.11391 813.08830 825.15259 1091.3188 10
recast 215.35813 223.60715 287.28034 261.23333 338.36813 477.3355 10
m2col 10.28721 11.35237 38.72393 14.46307 23.64113 154.3357 10
reshape 143.75546 171.68592 379.05752 224.13671 269.95301 1730.5892 10
时间显示,同时两列的melt() 比第二快的reshape() 快大约 15 倍。两个recast 变体都落后了,因为它们都需要两次整形操作。 tidyr 解决方案特别慢。