到目前为止,Len Greski 和 InfiniteFlashChess 首先发布的所有答案都建议对每个年龄段使用重复子集语句或重复调用 ifelse()。
恕我直言,这不能被认为是干净的,因为它不能很好地适应年龄范围的数量。 Onyambu in his comment 建议的唯一数据驱动解决方案是使用基础 R 中的 cut() 函数。
在这里,我建议另一种数据驱动的解决方案,它使用具有年龄范围和相关标签的上下界的查找表,并且在非等值连接中更新。这将允许我们指定任意数量的范围,而无需对代码进行任何更改:
library(data.table)
# define lookup table
lookup <- data.table(
lower = c(25L, 51L),
upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]
lower upper label
1: 25 50 25-50 Years
2: 51 100 51-100 Years
# create sample data set
Data <- data.frame(Age = c(24:26, 49:52, 100:102))
# update in non-equi join
setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label][]
Age Age2
1: 24 NA
2: 25 25-50 Years
3: 26 25-50 Years
4: 49 25-50 Years
5: 50 25-50 Years
6: 51 51-100 Years
7: 52 51-100 Years
8: 100 51-100 Years
9: 101 NA
10: 102 NA
请注意,NA 表示查找表中定义的年龄范围之间的差距。
基准测试
InfiniteFlashChess 询问了基准测试结果。
任何基准测试都取决于Data 中的行数以及组数,即年龄范围。因此,我们将对 100 和 1 M 行以及 2 个组(由 OP 指定)和 8 个组进行基准测试。
2组的基准代码:
library(data.table)
library(dplyr)
n_row <- 1E2L
set.seed(123L)
Data0 <- data.frame(Age = sample.int(105L, n_row, TRUE))
lookup <- data.table(
lower = c(25L, 51L),
upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]
microbenchmark::microbenchmark(
ifelse = {
copy(Data0) %>%
mutate(Age2 = ifelse(between(Age, 25, 50), "25 - 50 Years",
ifelse(between(Age, 51, 100), "51 - 100 Years",
"")))
},
cut = {
copy(Data0) %>%
mutate(Age2 = cut(Age, c(24,50,100), c("25-50 years","51-100 Years")))
},
baseR = {
data <- copy(Data0)
data$age2 <- ""
data$age2[data$Age %in% 51:100] <- "51 - 100 years"
data$age2[data$Age %in% 25:50] <- "25 - 50 years"
},
join_dt = {
Data <- copy(Data0)
setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label]
},
times = 100L
)
100 行的基准测试结果:
Unit: microseconds
expr min lq mean median uq max neval cld
ifelse 2280.588 2415.006 2994.83792 2501.8495 2827.513 20545.672 100 c
cut 2272.280 2407.455 2716.67432 2537.3425 2827.135 7351.495 100 c
baseR 57.016 83.446 94.80729 91.1865 106.667 164.248 100 a
join_dt 1165.970 1318.889 1560.19394 1485.4025 1691.939 2803.159 100 b
1 M 行的基准测试结果:
Unit: milliseconds
expr min lq mean median uq max neval cld
ifelse 618.08286 626.72757 672.28875 639.04973 758.83435 773.25566 10 c
cut 197.16467 200.53571 219.58635 203.77460 214.24227 343.56061 10 b
baseR 52.96059 59.36964 76.09431 62.19055 66.32506 198.73654 10 a
join_dt 66.89256 67.61147 73.33428 72.55457 78.18675 81.69146 10 a
8组基准测试需要编写嵌套ifelse()或重复子集操作:
breaks <- seq(20, 100, 10)
lookup <- data.table(
lower = head(breaks, -1L),
upper = tail(breaks, -1L)
)
lookup[, label := sprintf("%i-%i Years", lower + 1L, upper)][]
microbenchmark::microbenchmark(
ifelse = {
copy(Data0) %>%
mutate(
Age2 = ifelse(
between(Age, 21, 30), "21 - 20 Years", ifelse(
between(Age, 31, 40), "31 - 40 Years", ifelse(
between(Age, 41, 50), "41 - 50 Years", ifelse(
between(Age, 51, 60), "51 - 60 Years", ifelse(
between(Age, 61, 70), "61 - 70 Years", ifelse(
between(Age, 71, 80), "71 - 80 Years", ifelse(
between(Age, 81, 90), "81 - 90 Years", ifelse(
between(Age, 91, 100), "91 - 100 Years", "")))))))))
},
cut = {
copy(Data0) %>%
mutate(Age2 = cut(Age, breaks))
},
subset = {
data <- copy(Data0)
data$age2 <- ""
data$age2[data$Age %in% 21:30] <- "21 - 30 years"
data$age2[data$Age %in% 31:40] <- "31 - 40 years"
data$age2[data$Age %in% 41:50] <- "41 - 50 years"
data$age2[data$Age %in% 51:60] <- "51 - 60 years"
data$age2[data$Age %in% 61:70] <- "61 - 70 years"
data$age2[data$Age %in% 71:80] <- "71 - 80 years"
data$age2[data$Age %in% 81:90] <- "81 - 90 years"
data$age2[data$Age %in% 91:100] <- "91 - 100 years"
},
join_dt = {
Data <- copy(Data0)
setDT(Data)[lookup, on =.(Age > lower, Age <= upper), Age2 := label]
},
times = 100L
)
100 行的基准测试结果:
Unit: microseconds
expr min lq mean median uq max neval cld
ifelse 2522.617 2663.832 2955.2448 2740.1030 2886.4155 7717.748 100 d
cut 2340.622 2470.699 2664.9381 2538.6635 2646.6520 7608.627 100 c
subset 174.820 199.741 219.6505 210.5015 231.4575 402.501 100 a
join_dt 1198.819 1290.949 1406.2354 1399.1255 1488.4240 1810.500 100 b
1 M 行的基准测试结果:
Unit: milliseconds
expr min lq mean median uq max neval cld
ifelse 2427.0599 2429.42131 2539.88611 2457.06191 2565.14682 2992.68891 10 c
cut 220.3553 221.53939 243.49476 222.82165 230.06289 406.57277 10 b
subset 176.0096 177.92958 199.13398 184.26878 192.60274 323.90338 10 b
join_dt 62.7471 64.26875 67.94099 65.07508 75.03169 75.38813 10 a