根据 R 中的条件创建新的列变量答案

【问题标题】：Create new column variables based upon condition in R根据 R 中的条件创建新的列变量
【发布时间】：2017-12-29 01:18:09
【问题描述】：

我不知道为什么这不起作用。我已经尝试了各种方法，但它不起作用。使用我自己制作的 if 语句并不是说我出错了，但它并不适用。

基本上，有一列Data$Age和一列Data$Age2。

如果 Data$Age 的值是 50 - 100，我希望 Data$Age2 为该特定行说“50-100 年”。

同样，如果 Data$Age 是 25-50，我希望 Data$Age2 对它适用的行说“25-50 年”。

在 R 中最干净的方法是什么？

【问题讨论】：

试试paste(Data$Age, "Years")
@Onyambu 我认为 OP 的意思是 A %in% 50:100 -> "50-100 Years"
@PoGibas 我明白你的意思，cut(data$Age,c(25,50,100),c("25-50 years","50-100 Years"))
@Onyambu -- 需要cut(data$Age,c(24,50,100),c("25-50 years","50-100 years")) 才能使age = 25 属于25 - 50 years 类别。
@Onyambu - 也就是说，我同意cut() 是最干净的解决方案，它不需要额外的包。

标签： r if-statement

【解决方案1】：

dplyr 可能对此有最干净的解决方案

使用下面 Len Greski 的示例数据...

data <- data.frame(Age1 = round(runif(100)*100,0))

data%>%
mutate(Age2 = ifelse(between(Age1, 25, 50), "25 - 50 Years", 
              ifelse(between(Age1, 51, 100),"51 - 100 Years", "Less than 25 years old")))

假设您只需要该列的两个值。 ifelse() 对于两个以上的匹配（比如 100 个）效率不高。如果不是，我将不得不考虑另一种方法。

编辑： 或者正如 Len 在下面的评论中所建议的那样。

data%>% 
mutate(Age2 = cut(Age1,c(24,50,100),c("25-50 years","51-100 Years")))

【讨论】：

这是替代方法，使用 Onyambu 解决方案的更正版本：data%>% mutate(Age2 = cut(Age1,c(24,50,100),c("25-50 years","51-100 Years")))
编辑后的版本非常适合我，我也可以将它应用到其他东西上！

【解决方案2】：

到目前为止，Len Greski 和 InfiniteFlashChess 首先发布的所有答案都建议对每个年龄段使用重复子集语句或重复调用 ifelse()。

恕我直言，这不能被认为是干净的，因为它不能很好地适应年龄范围的数量。 Onyambu in his comment 建议的唯一数据驱动解决方案是使用基础 R 中的 cut() 函数。

在这里，我建议另一种数据驱动的解决方案，它使用具有年龄范围和相关标签的上下界的查找表，并且在非等值连接中更新。这将允许我们指定任意数量的范围，而无需对代码进行任何更改：

library(data.table)
# define lookup table
lookup <- data.table(
  lower = c(25L, 51L),
  upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]

   lower upper        label
1:    25    50  25-50 Years
2:    51   100 51-100 Years

# create sample data set
Data <- data.frame(Age = c(24:26, 49:52, 100:102))

# update in non-equi join
setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label][]

    Age         Age2
 1:  24           NA
 2:  25  25-50 Years
 3:  26  25-50 Years
 4:  49  25-50 Years
 5:  50  25-50 Years
 6:  51 51-100 Years
 7:  52 51-100 Years
 8: 100 51-100 Years
 9: 101           NA
10: 102           NA

请注意，NA 表示查找表中定义的年龄范围之间的差距。

基准测试

InfiniteFlashChess 询问了基准测试结果。

任何基准测试都取决于Data 中的行数以及组数，即年龄范围。因此，我们将对 100 和 1 M 行以及 2 个组（由 OP 指定）和 8 个组进行基准测试。

2组的基准代码：

library(data.table)
library(dplyr)
n_row <- 1E2L
set.seed(123L)
Data0 <- data.frame(Age = sample.int(105L, n_row, TRUE))

lookup <- data.table(
  lower = c(25L, 51L),
  upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]

microbenchmark::microbenchmark(
  ifelse = {
    copy(Data0) %>%
      mutate(Age2 = ifelse(between(Age, 25, 50), "25 - 50 Years", 
                           ifelse(between(Age, 51, 100), "51 - 100 Years", 
                                  "")))
  },
  cut = {
    copy(Data0) %>% 
      mutate(Age2 = cut(Age, c(24,50,100), c("25-50 years","51-100 Years")))
  },
  baseR = {
    data <- copy(Data0)
    data$age2 <- ""
    data$age2[data$Age %in% 51:100] <- "51 - 100 years"
    data$age2[data$Age %in% 25:50] <- "25 - 50 years"
  },
  join_dt = {
    Data <- copy(Data0)
    setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label]
  },
  times = 100L
)

100 行的基准测试结果：

Unit: microseconds
    expr      min       lq       mean    median       uq       max neval cld
  ifelse 2280.588 2415.006 2994.83792 2501.8495 2827.513 20545.672   100   c
     cut 2272.280 2407.455 2716.67432 2537.3425 2827.135  7351.495   100   c
   baseR   57.016   83.446   94.80729   91.1865  106.667   164.248   100 a  
 join_dt 1165.970 1318.889 1560.19394 1485.4025 1691.939  2803.159   100  b

1 M 行的基准测试结果：

Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval cld
  ifelse 618.08286 626.72757 672.28875 639.04973 758.83435 773.25566    10   c
     cut 197.16467 200.53571 219.58635 203.77460 214.24227 343.56061    10  b 
   baseR  52.96059  59.36964  76.09431  62.19055  66.32506 198.73654    10 a  
 join_dt  66.89256  67.61147  73.33428  72.55457  78.18675  81.69146    10 a

8组基准测试需要编写嵌套ifelse()或重复子集操作：

breaks <- seq(20, 100, 10)

lookup <- data.table(
  lower = head(breaks, -1L),
  upper = tail(breaks, -1L)
)
lookup[, label := sprintf("%i-%i Years", lower + 1L, upper)][]


microbenchmark::microbenchmark(
  ifelse = {
    copy(Data0) %>%
      mutate(
        Age2 = ifelse(
          between(Age, 21, 30), "21 - 20 Years", ifelse(
            between(Age, 31, 40), "31 - 40 Years", ifelse(
              between(Age, 41, 50), "41 - 50 Years", ifelse(
                between(Age, 51, 60), "51 - 60 Years", ifelse(
                  between(Age, 61, 70), "61 - 70 Years", ifelse(
                    between(Age, 71, 80), "71 - 80 Years", ifelse(
                      between(Age, 81, 90), "81 - 90 Years", ifelse(
                        between(Age, 91, 100), "91 - 100 Years", "")))))))))
  },
  cut = {
    copy(Data0) %>% 
      mutate(Age2 = cut(Age, breaks))
  },
  subset = {
    data <- copy(Data0)
    data$age2 <- ""
    data$age2[data$Age %in% 21:30] <- "21 - 30 years"
    data$age2[data$Age %in% 31:40] <- "31 - 40 years"
    data$age2[data$Age %in% 41:50] <- "41 - 50 years"
    data$age2[data$Age %in% 51:60] <- "51 - 60 years"
    data$age2[data$Age %in% 61:70] <- "61 - 70 years"
    data$age2[data$Age %in% 71:80] <- "71 - 80 years"
    data$age2[data$Age %in% 81:90] <- "81 - 90 years"
    data$age2[data$Age %in% 91:100] <- "91 - 100 years"
  },
  join_dt = {
    Data <- copy(Data0)
    setDT(Data)[lookup, on =.(Age > lower, Age <= upper), Age2 := label]
  },
  times = 100L
)

100 行的基准测试结果：

Unit: microseconds
    expr      min       lq      mean    median        uq      max neval  cld
  ifelse 2522.617 2663.832 2955.2448 2740.1030 2886.4155 7717.748   100    d
     cut 2340.622 2470.699 2664.9381 2538.6635 2646.6520 7608.627   100   c 
  subset  174.820  199.741  219.6505  210.5015  231.4575  402.501   100 a   
 join_dt 1198.819 1290.949 1406.2354 1399.1255 1488.4240 1810.500   100  b

1 M 行的基准测试结果：

Unit: milliseconds
    expr       min         lq       mean     median         uq        max neval cld
  ifelse 2427.0599 2429.42131 2539.88611 2457.06191 2565.14682 2992.68891    10   c
     cut  220.3553  221.53939  243.49476  222.82165  230.06289  406.57277    10  b 
  subset  176.0096  177.92958  199.13398  184.26878  192.60274  323.90338    10  b 
 join_dt   62.7471   64.26875   67.94099   65.07508   75.03169   75.38813    10 a

【讨论】：

@Uwe - 很好的分析。我同意如果有很多年龄类别，最好的解决方案是加入。我不会提出具有 8 个年龄类别的提取运算符解决方案。另一方面，cut() 方法的 0.2 秒和 join_dt 方法的 0.062 秒之间的差异在您测试的数据大小上没有实质意义，因此cut() 在代码的简约性和简约性之间提供了最佳平衡。表现。您的分析突出了 R 的挑战之一，特定问题的解决方案的多样性需要进行大量分析才能选择“最佳”解决方案。

【解决方案3】：

这是使用基数 R 的解决方案。请注意，由于 age2 不能同时为 25 - 50 和 50 - 100，因此我将类别设为互斥：

data <- data.frame(age = round(runif(100)*100,0),
                   age2 = rep(" ",100),stringsAsFactors=FALSE)
data$age2[data$age %in% 51:100] <- "51 - 100 years"
data$age2[data$age %in% 25:50] <- "25 - 50 years"
data[1:15,]

...和输出：

> data[1:15,]
   age           age2
1    0               
2   45  25 - 50 years
3   58 51 - 100 years
4   59 51 - 100 years
5   84 51 - 100 years
6   79 51 - 100 years
7    5               
8   78 51 - 100 years
9   46  25 - 50 years
10   6               
11  73 51 - 100 years
12  37  25 - 50 years
13   5               
14  41  25 - 50 years
15  58 51 - 100 years
>

【讨论】：

这有点难以理解？如果年龄是 30.5 岁呢？你会怎么处理？只是好奇，因为30.5%in%1:50 会返回FALSE 你怎么看？
也许使用findInterval(data$age,c(25,50,100),highest/right=T) 然后设置 1=="20-50" 和 2="50-100"
@Onyambu - 好点。在社会科学中，age 通常是一个整数，这就是我将样本数据四舍五入为整数的原因。如果数据不是以整数表示，更准确的比较是between()，如data$age2[between(data$age,25,50)] <- "25 - 50 years"。
为什么不findInterval？ between 不在你知道的基础包中。 data$age2[findInterval(data$age,c(25,50,100)==1] <- "25 - 50 years"
@Onyambu - findInterval() 也可以，但between() 由代码作者以外的人阅读时更容易理解。 R 真正让我着迷的一件事是可以通过多少种方式完成给定任务。