【问题标题】:Converting Column Values into Their Own Binary Encoded Columns (Dummy Variables)将列值转换为自己的二进制编码列(虚拟变量)
【发布时间】:2015-07-28 15:05:37
【问题描述】:

我有许多包含性别、年龄、诊断等列的 CSV 文件。

目前,它们的编码如下:

ID, gender, age, diagnosis
1,  male,   42,  asthma
1,  male,   42,  anxiety
2,  male,   19,  asthma
3,  female, 23,  diabetes
4,  female, 61,  diabetes
4,  female, 61,  copd

目标是将这些数据转换成这种目标格式

旁注:如果可能,最好将原始列名添加到新列名之前,例如“age_42”或“gender_female”。

ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1,  1,    0,      1,  0,  0,  0,  1,      1,       0,        0
2,  1,    0,      0,  1,  0,  0,  1,      0,       0,        0
3,  0,    1,      0,  0,  1,  0,  0,      0,       1,        0
4,  0,    1,      0,  0,  0,  1,  0,      0,       1,        1 

我尝试使用 reshape2 的 dcast() 函数,但得到的组合导致矩阵极其稀疏。这是一个仅包含年龄和性别的简化示例:

data.train  <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)

ID, male19, male23, male42, male61, female19, female23, female42, female61
1,  0,      0,      1,      0,      0,        0,        0,        0
2,  1,      0,      0,      0,      0,        0,        0,        0
3,  0,      0,      0,      0,      0,        1,        0,        0
4,  0,      0,      0,      0,      0,        0,        0,        1   

鉴于这是机器学习数据准备中相当常见的任务,我想可能还有其他库(我不知道)能够执行此转换。

【问题讨论】:

    标签: r sparse-matrix reshape2


    【解决方案1】:

    您需要在此处使用melt/dcast 组合(称为recast),以便将所有列转换为一列并避免组合

    library(reshape2)
    recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
    #   ID 19 23 42 61 anxiety asthma copd diabetes female male
    # 1  1  0  0  1  0       1      1    0        0      0    1
    # 2  2  1  0  0  0       0      1    0        0      0    1
    # 3  3  0  1  0  0       0      0    0        1      1    0
    # 4  4  0  0  0  1       0      0    1        1      1    0
    

    根据您的旁注,您可以在此处添加 variable 以便也添加名称

    recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
    #   ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd
    # 1  1             0           1      0      0      1      0                 1                1              0
    # 2  2             0           1      1      0      0      0                 0                1              0
    # 3  3             1           0      0      1      0      0                 0                0              0
    # 4  4             1           0      0      0      0      1                 0                0              1
    #   diagnosis_diabetes
    # 1                  0
    # 2                  0
    # 3                  1
    # 4                  1
    

    【讨论】:

      【解决方案2】:

      caret 包中有一个函数可以“虚拟化”数据。

      library(caret)
      library(dplyr)
      predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)
      

      【讨论】:

        【解决方案3】:

        base R 选项将是

         (!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L
         #     values
         #ID  19 23 42 61 anxiety asthma copd diabetes female male
         # 1  0  0  1  0       1      1    0        0      0    1
         # 2  1  0  0  0       0      1    0        0      0    1
         # 3  0  1  0  0       0      0    0        1      1    0
         # 4  0  0  0  1       0      0    1        1      1    0
        

        如果你也需要原名

         (!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L
         #   Val
         #ID  age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma
         #1      0      0      1      0                 1                1
         #2      1      0      0      0                 0                1
         #3      0      1      0      0                 0                0
         #4      0      0      0      1                 0                0
         #  Val
         #ID  diagnosis_copd diagnosis_diabetes gender_female gender_male
         #1              0                  0             0           1
         #2              0                  0             0           1
         #3              0                  1             1           0
         #4              1                  1             1           0
        

        数据

        df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male", 
        "male", "male", "female", "female", "female"), age = c(42L, 42L, 
        19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma", 
        "diabetes", "diabetes", "copd")), .Names = c("ID", "gender", 
        "age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")
        

        【讨论】:

        • +1 以获得良好的原版实现;我也要试试这个。我的数据还没有格式化,但这是一个数据清理问题,而不是结构问题。
        • 更新:接受这个作为答案;在测试它时,它提供了最一致的实现和输出。
        【解决方案4】:

        使用来自基础 R 的reshape

        d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_")
        a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_")
        g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_")
        
        
        new.dat <- cbind(ID=d["ID"],
            g[,grepl("_", names(g))],
            a[,grepl("_", names(a))],
            d[,grepl("_", names(d))])
        
        # convert factors columns to character (if necessary)
        # taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
        new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character)
        
        new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0
        new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1
        
        #  ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma
        #1  1           1             0      1      0      0      0                1
        #3  2           1             0      0      1      0      0                1
        #4  3           0             1      0      0      1      0                0
        #5  4           0             1      0      0      0      1                0
        #  diagnosis_anxiety diagnosis_diabetes diagnosis_copd
        #1                 1                  0              0
        #3                 0                  0              0
        #4                 0                  1              0
        #5                 0                  1              1
        

        【讨论】:

          【解决方案5】:

          下面是dcast()merge() 的稍长方法。由于性别和年龄不是 ID 唯一的,因此创建了一个函数来将其长度转换为虚拟变量 (dum())。另一方面,通过调整公式将诊断设置为唯一计数。

          library(reshape2)
          data.raw <- read.table(header = T, sep = ",", text = "
          id, gender, age, diagnosis
          1,  male,   42,  asthma
          1,  male,   42,  anxiety
          2,  male,   19,  asthma
          3,  female, 23,  diabetes
          4,  female, 61,  diabetes
          4,  female, 61,  copd")
          
          # function to create a dummy variable
          dum <- function(x) { if(length(x) > 0) 1 else 0 }
          
          # length of dignosis by id, gender and age
          diag <- dcast(data.raw, formula = id + gender + age ~ diagnosis, fun.aggregate = length)[,-c(2,3)]
          
          # length of gender by id
          gen <- dcast(data.raw, formula = id ~ gender, fun.aggregate = dum)
          
          # length of age by id
          age <- dcast(data.raw, formula = id ~ age, fun.aggregate = dum)
          
          merge(merge(gen, age, by = "id"), diag, by = "id")
          #  id   female   male 19 23 42 61   anxiety   asthma   copd   diabetes
          #1  1        0      1  0  0  1  0         1        1      0          0
          #2  2        0      1  1  0  0  0         0        1      0          0
          #3  3        1      0  0  1  0  0         0        0      0          1
          #4  4        1      0  0  0  0  1         0        0      1          1
          

          实际上我不太了解您的模型,但您的设置可能太多,因为 R 通过公式对象处理因素。例如,如果性别是响应,则在 R 中将生成以下矩阵。因此,只要您不打算自己适应,适当地设置数据类型和公式就足够了。

          data.raw$age <- as.factor(data.raw$age)
          model.matrix(gender ~ ., data = data.raw[,-1])
          #(Intercept) age23 age42 age61 diagnosis  asthma diagnosis  copd diagnosis  diabetes
          #1           1     0     1     0                 1               0                   0
          #2           1     0     1     0                 0               0                   0
          #3           1     0     0     0                 1               0                   0
          #4           1     1     0     0                 0               0                   1
          #5           1     0     0     1                 0               0                   1
          #6           1     0     0     1                 0               1                   0
          

          如果您需要每个变量的所有级别,您可以通过抑制 model.matrix 中的截距并使用 all-levels-of-a-factor-in-a-model-matrix-in-r 中的小技巧来做到这一点

          #  Using Akrun's df1, first change all variables, except ID, to factor
          df1[-1] <- lapply(df1[-1], factor)
          
          # Use model.matrix to derive dummy coding
          m <- data.frame(model.matrix( ~ 0 + . , data=df1, 
                       contrasts.arg = lapply(df1[-1], contrasts, contrasts=FALSE)))
          
          # Collapse to give final solution
          aggregate(. ~ ID, data=m, max)
          

          【讨论】:

          • 嗨 Jaehyeon ...我认为不值得将此作为单独的答案添加,并且似乎适合这里。如果您不想要,请回滚编辑。
          猜你喜欢
          • 2017-09-18
          • 1970-01-01
          • 1970-01-01
          • 2018-09-07
          • 1970-01-01
          • 1970-01-01
          • 2020-11-28
          • 2021-07-18
          • 1970-01-01
          相关资源
          最近更新 更多