根据另一个数据帧中的值在一个数据帧中创建指示变量答案

【问题标题】：Create an indicator variable in one data frame based on values in another data frame根据另一个数据帧中的值在一个数据帧中创建指示变量
【发布时间】：2021-05-19 23:10:59
【问题描述】：

假设，我有一个名为 iris 的数据集。我想在这个数据集中创建一个名为 sepal_length_group 的指标变量。该指标的值为 p25、p50、p75 和 p100。例如，如果物种是“setosa”并且Sepal.Length 等于或小于归类为“setosa”的所有物种的第 25 个百分位，我希望 sepal_length_group 等于“p25”进行观察。我编写了以下代码，但它会生成所有 NA：

library(skimr)

sepal_length_distribution <- iris %>% group_by(Species) %>% skim(Sepal.Length) %>% select(3, 9:12)

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),2], "p25", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),2] &
                                                Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),3], "p50", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),3] &
                                                        Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),4], "p75", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),4] &
                                                        Sepal.Length < sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),5], "p100", NA))

任何帮助将不胜感激！

【问题讨论】：

那么按组划分分位数？
一些应该有帮助的帖子：stackoverflow.com/q/60291876/5325862, stackoverflow.com/q/42948306/5325862
所以您特别想使用skimr 输出？当您说指标变量时，您的意思是您基本上想要一个有序因子？

标签： r if-statement dplyr skimr

【解决方案1】：

这可以简单地通过使用cut 函数来完成，正如@Camille 所评论的那样

library(tidyverse)
iris %>%
  group_by(Species) %>%
  mutate(cat = cut(Sepal.Length, 
                   quantile(Sepal.Length, c(0,.25,.5,.75, 1)),
                   paste0('p', c(25,50, 75, 100)), include.lowest = TRUE))

【讨论】：

谢谢。这解决了我的这个特殊问题。但我在考虑一个更一般的情况，我可能想要创建一个指标变量，该变量将基于不同数据帧中的特定单元格。这就是为什么我尝试使用which的原因。