【问题标题】:creating a factor variable with dplyr?用 dplyr 创建一个因子变量?
【发布时间】:2014-09-30 14:37:17
【问题描述】:

假设我有一个看起来像这样的数据框:

df1=structure(list(Name = structure(1:6, .Label = c("N1", "N2", "N3", 
                                                    "N4", "N5", "N6", "N7"), class = "factor"), sector = structure(c(4L, 
                                                                                                                     4L, 4L, 3L, 3L, 2L), .Label = c("other stuff", "Private for-profit, 4-year or above", 
                                                                                                                                                     "Private not-for-profit, 4-year or above", "Public, 4-year or above"
                                                                                                                     ), class = "factor"), flagship = c(1, 0, 0, 0, 0, 0)), .Names = c("Name", 
                                                                                                                                                                                       "sector", "flagship"), row.names = c(NA, 6L), class = "data.frame")

我想创建一个新的因子变量“Sector”。我可以用很多行代码来完成它,但我确信有一种更有效的方法。

现在这就是我正在做的事情:

df1$PublicFlag=0
df1$PublicFlag[df1$sector=="Public, 4-year or above" & df1$flagship==1]=1
df1$Public=0
df1$Public[df1$sector=="Public, 4-year or above" & df1$flagship==0]=1
df1$PrivateNP=0
df1$PrivateNP[df1$sector=="Private not-for-profit"]=1
df1$Private4P=0
df1$Private4P[df1$sector=="Private for-profit, 4-year or above"]=1

library(reshape)
df2 = melt(df1, id=c("Name", "sector", "flagship"))
df2 = df2[df2$value==1,c("Name", "sector", "flagship", "variable")]
library(plyr)
df2 = rename(df2, c("variable"="Sector"))

感谢您的帮助!

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    这是一个旧帖子,但我经常偶然发现它。这就是为什么我想给出一个最新的答案。 Version 0.5.0 of dplyr 引入了很多有用的向量函数来解决这个问题。

    使用 case_when() 避免 ifelse 嵌套(从而使许多小猫存活):

    df1 %>% 
      mutate(Sector = case_when(
            sector=="Public, 4-year or above" & flagship==1 ~ "PublicFlag",
            sector=="Public, 4-year or above" & flagship==0 ~ "Public",
            sector=="Private not-for-profit" ~ "PrivateNP",
            sector=="Private for-profit, 4-year or above" ~ "Private4P"),
        Sector = factor(Sector, levels=c("Public","PublicFlag","PrivateNP","Private4P"))
      )
    

    使用 recode_factor() 从字符(或数字)变量生成因子:

    df1 %>%
        mutate(Sector = recode_factor(sector,
                                   "Public, 4-year or above" = "Public",
                                   "Private not-for-profit" = "PrivateNP",
                                   "Private for-profit, 4-year or above" = "Private4P"))
    

    【讨论】:

      【解决方案2】:

      试试:

      df1$Sector <-  with(df1, c("Private4P", NA, "Public",
                       "PublicFlag")[as.numeric(factor(1+2*as.numeric(sector)+4*flagship))])
      
      
      
       subset(df1, !is.na(Sector))
       #  Name                              sector flagship     Sector 
       #1   N1             Public, 4-year or above        1 PublicFlag
       #2   N2             Public, 4-year or above        0     Public
       #3   N3             Public, 4-year or above        0     Public
       #6   N6 Private for-profit, 4-year or above        0  Private4P
      

      【讨论】:

        【解决方案3】:

        你甚至不需要dplyr

        df1$Sector <- factor(ifelse(df1$sector=="Public, 4-year or above" & df1$flagship==1, "PublicFlag",
                               ifelse(df1$sector=="Public, 4-year or above" & df1$flagship==0, "Public",
                                 ifelse(df1$sector=="Private not-for-profit", "PrivateNP", 
                                   ifelse(df1$sector=="Private for-profit, 4-year or above", "Private4P", NA)))))
        
        
        df1
        
        ##   Name                                  sector flagship     Sector
        ## 1   N1                 Public, 4-year or above        1 PublicFlag
        ## 2   N2                 Public, 4-year or above        0     Public
        ## 3   N3                 Public, 4-year or above        0     Public
        ## 4   N4 Private not-for-profit, 4-year or above        0       <NA>
        ## 5   N5 Private not-for-profit, 4-year or above        0       <NA>
        ## 6   N6     Private for-profit, 4-year or above        0  Private4P
        

        如果需要,您可以将NA 替换为最终可能的因子水平

        【讨论】:

        • 每当你把嵌套的 if-else 嵌套得那么深,我就杀了一只小猫。
        【解决方案4】:

        所选答案不适用于我正在处理的特定问题,因为我在 case_when() 中分配了数值并尝试为其赋予字符级别。我想添加我为解决我的特定问题所做的事情作为替代方案,以防将来有人发现它有用。

        df1 %>% 
          mutate(Sector = case_when(
                sector=="Public, 4-year or above" & flagship==1 ~ "PublicFlag",
                sector=="Public, 4-year or above" & flagship==0 ~ "Public",
                sector=="Private not-for-profit" ~ "PrivateNP",
                sector=="Private for-profit, 4-year or above" ~ "Private4P") %>%
          as.factor() %>%
          structure(levels = c("Public","PublicFlag","PrivateNP","Private4P"))
          )
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2016-03-11
          • 2016-04-07
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多