【问题标题】:Split one variable into multiple variables in R在R中将一个变量拆分为多个变量
【发布时间】:2018-05-13 17:24:05
【问题描述】:

我对 R 比较陌生。我的问题并不完全像标题那么简单。这是df 的样例:

id    amenities
1     wireless internet, air conditioning, pool, kitchen
2     pool, kitchen, washer, dryer
3     wireless internet, kitchen, dryer
4     
5     wireless internet

这就是我想要 df 的样子:

id    wireless internet   air conditioning   pool   kitchen   washer   dryer
1     1                   1                  1      1         0        0
2     0                   0                  1      1         1        1
3     1                   0                  0      1         0        1
4     0                   0                  0      0         0        0
5     1                   0                  0      0         0        0

重现数据的示例代码

df <- data.frame(id = c(1, 2, 3, 4, 5),
      amenities = c("wireless internet, air conditioning, pool, kitchen",  
                    "pool, kitchen, washer, dryer", 
                    "wireless internet, kitchen, dryer", 
                    "", 
                    "wireless internet"), 
      stringsAsFactors = FALSE)

【问题讨论】:

    标签: r string variables dataframe split


    【解决方案1】:

    为了完整起见,这里也有一个简洁的data.table解决方案:

    library(data.table)
    setDT(df)[, strsplit(amenities, ", "), by = id][
      , dcast(.SD, id ~ V1, length)]
    
       id air conditioning dryer kitchen pool washer wireless internet
    1:  1                1     0       1    1      0                 1
    2:  2                0     1       1    1      1                 0
    3:  3                0     1       1    0      0                 1
    4:  5                0     0       0    0      0                 1
    

    强制转换为 data.table 后,amenities", " 拆分为每个项目的单独行(长格式)。然后将其重新整形为宽格式,使用length() 函数进行聚合。

    【讨论】:

      【解决方案2】:

      dummies 包在这里很有用。试试

      library(dplyr); library(tidyr); library(dummies)
      df2 <- df %>% separate_rows(amenities, sep = ",")
      df2$amenities <- trimws(df2$amenities, "both") # remove spaces (left and right) - so that you will not have 2 "pool" columns in your final data frame
      df2 <- dummy.data.frame(df2)[, -2]
      colnames(df2) <- trimws(gsub("amenities", "", colnames(df2)), "both") # arrange colnames
      df3 <- df2 %>% 
        group_by(id) %>%
        summarise_all(funs(sum)) ## aggregate by column and id
      df3
      
      # A tibble: 5 x 7
      #id `air conditioning` dryer kitchen  pool washer `wireless internet`
      #<dbl>              <int> <int>   <int> <int>  <int>               <int>
      #     1                  1     0       1     1      0                   1
      #     2                  0     1       1     1      1                   0
      #     3                  0     1       1     0      0                   1
      #     4                  0     0       0     0      0                   0
      #     5                  0     0       0     0      0                   1
      

      【讨论】:

        【解决方案3】:

        FWIW,这是一个基本的 R 方法(假设 df 包含您的数据,如问题所示)

        dat <- with(df, strsplit(amenities, ', '))
        df2 <- data.frame(id = factor(rep(df$id, times = lengths(dat)),
                                      levels = df$id),
                          amenities = unlist(dat))
        df3 <- as.data.frame(cbind(id = df$id,
                             table(df2$id, df2$amenities)))
        

        这会导致

        > df3
          id air conditioning dryer kitchen pool washer wireless internet
        1  1                1     0       1    1      0                 1
        2  2                0     1       1    1      1                 0
        3  3                0     1       1    0      0                 1
        4  4                0     0       0    0      0                 0
        5  5                0     0       0    0      0                 1
        

        分解正在发生的事情:

        1. dat &lt;- with(df, strsplit(amenities, ', '))', ' 上拆分amenities 变量,导致

          > dat
          [[1]]
          [1] "wireless internet" "air conditioning"  "pool"             
          [4] "kitchen"          
          
          [[2]]
          [1] "pool"    "kitchen" "washer"  "dryer"  
          
          [[3]]
          [1] "wireless internet" "kitchen"           "dryer"            
          
          [[4]]
          character(0)
          
          [[5]]
          [1] "wireless internet"
          
        2. 第二行采用dat 并将其转换为向量,然后我们通过重复原始id 值与id 的便利设施数量一样多次来添加id 列。这导致

          > df2
             id         amenities
          1   1 wireless internet
          2   1  air conditioning
          3   1              pool
          4   1           kitchen
          5   2              pool
          6   2           kitchen
          7   2            washer
          8   2             dryer
          9   3 wireless internet
          10  3           kitchen
          11  3             dryer
          12  5 wireless internet
          
        3. 使用table() 函数创建列联表,然后我们添加id 列。

        【讨论】:

        • 我有轻微的变化 - df2 &lt;- stack(setNames(dat,seq_along(dat)))[2:1] 然后cbind(df["id"], unclass(table(df2)))
        【解决方案4】:

        使用dplyrtidyr 的解决方案。请注意,我将"" 替换为None,因为以后处理列名更容易。

        library(dplyr)
        library(tidyr)
        
        df2 <- df %>%
          separate_rows(amenities, sep = ",") %>%
          mutate(amenities = ifelse(amenities %in% "", "None", amenities)) %>%
          mutate(value = 1) %>%
          spread(amenities, value , fill = 0) %>%
          select(-None)
        df2
        #   id  air conditioning  dryer  kitchen  pool  washer pool wireless internet
        # 1  1                 1      0        1     1       0    0                 1
        # 2  2                 0      1        1     0       1    1                 0
        # 3  3                 0      1        1     0       0    0                 1
        # 4  4                 0      0        0     0       0    0                 0
        # 5  5                 0      0        0     0       0    0                 1
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2016-04-06
          • 1970-01-01
          • 2017-02-25
          • 2015-08-17
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多