基于另一个变量创建新列答案

【问题标题】：Create new column based on another variable基于另一个变量创建新列
【发布时间】：2021-04-04 11:58:16
【问题描述】：

我有一个包含几列的数据框。其中之一是participant 列，其中列出了不同的参与者代码。这些都在 100 范围、200 范围或 500 范围内。例如：101, 203, 209, 504, 103, 512 等等。

我想在名为 group 的数据框中创建一个额外的列，其中包含 3 个可能的值：100、200 和 500。因此，根据参与者代码开头的数字，它将被分配这 3 个标签之一。

我尝试过使用startsWith() 和ifelse 语句的组合，但我无法使其工作。

data$group = ifelse(startsWith(as.character(data$participant), "1"), "100", 
                    ((ifelse(startsWith(as.character(data$participant), "2"), "200",
                           (ifelse(startsWith(as.character(data$participant), "5"), "500")), NULL)))

【问题讨论】：

标签： r

【解决方案1】：

根据您的示例和 cmets，您似乎希望将数值划分为范围并分配字符标签。

case_when 提供了一个简单的选项。打字需要更长的时间，但对于不熟悉 cut 或更多数学方法的人来说可能更易读。

tibble(old = c(101, 203, 209, 504, 103, 512)) %>%
    mutate(
        new = case_when(
            old < 100 ~ NA_character_,
            old < 200 ~ "100",
            old < 300 ~ "200",
            old < 400 ~ "300",
            old < 500 ~ "400",
            old < 600 ~ "500",
            TRUE ~ NA_character_
        )
    )

结果

# A tibble: 6 x 2
    old new  
  <dbl> <chr>
1   101 100  
2   203 200  
3   209 200  
4   504 500  
5   103 100  
6   512 500

也就是说，cut 函数旨在完全按照您的描述进行，并且可以选择指定输出标签。

old <- c(101, 203, 209, 504, 103, 512)

new <- cut(
    x = old, 
    breaks = seq(from = 100, to = 600, by = 100), 
    labels = seq(from = 100, to = 500, by = 100)
)

as.character(new)

结果

[1] "100" "200" "200" "500" "100" "500"

【讨论】：

【解决方案2】：

简单的 tidyverse 解决方案（类似于 s__ 解决方案。）

tibble(
participant = c(101, 203, 209, 504, 103, 512),
group = round(participant, -2)
)

# A tibble: 6 x 2
  participant group
        <dbl> <dbl>
1         101   100
2         203   200
3         209   200
4         504   500
5         103   100
6         512   500

【讨论】：

【解决方案3】：

也许这可以更容易地完成

(data$participant %/% 100) * 100
#[1] 100 200 200 500 100 500

在 OP 的代码中，最后一个“否”应该是 NA_character_ 而不是 NULL，因为 NULL 返回的 length 为 0。例如

 v1 <- c(10, 20, 5, 2, 40)
 ifelse(v1 > 50, 3, NULL)

ans[npos]

ifelse(v1 > 50, 3, NA)
#[1] NA NA NA NA NA

数据

data <- structure(list(participant = c(101, 203, 209, 504, 103, 512)), 
     class = "data.frame", row.names = c(NA, -6L))

【讨论】：

【解决方案4】：

你也可以用round()管理它：

x <- c(101, 203, 209, 504, 103, 512)
round(x, -2)
[1] 100 200 200 500 100 500

在你的情况下：

data$group <- round(data$participant, -2)

【讨论】：

【解决方案5】：

使用ifelse：

data$group <- ifelse(data$participant > 100 & data$participant <= 200, 100,
                     ifelse(data$participant > 200 & data$participant <= 300, 200, 500))

结果：

data
  participant group
1         101   100
2         203   200
3         209   200
4         504   500
5         103   100
6         512   500

【讨论】：

【解决方案6】：

它相当冗长，但它只是另一种方式：

library(dplyr)

participant <- c(101, 203, 209, 504, 103, 512)

df <- tibble(participant)

df %>%
  mutate(group = case_when(
    participant %in% 100:199 ~ 100,
    participant %in% 200:299 ~ 200,
    participant %in% 500:599 ~ 500
  ))

# A tibble: 6 x 2
  participant group
        <dbl> <dbl>
1         101   100
2         203   200
3         209   200
4         504   500
5         103   100
6         512   500

【讨论】：

【解决方案7】：

data.table中的另一个选项你可以试试

library(data.table)
df <- data.table(participants=c(101, 203, 209, 504, 103, 512))
df[,groups:= (participants - participants%%100)]
   participants groups
1:          101    100
2:          203    200
3:          209    200
4:          504    500
5:          103    100
6:          512    500

不完全是您的答案，但您也可以使用 cut 函数，例如，在 data.table 中，它可能如下所示：

library(data.table)

df <- data.table(participants = c(101, 203, 209, 504, 103, 512))
df[, groups:=cut(participants, seq(100,600,100))]

   participants    groups
1:          101 (100,200]
2:          203 (200,300]
3:          209 (200,300]
4:          504 (500,600]
5:          103 (100,200]
6:          512 (500,600]

【讨论】：