将列值转换为自己的二进制编码列（虚拟变量）答案

【问题标题】：Converting Column Values into Their Own Binary Encoded Columns (Dummy Variables)将列值转换为自己的二进制编码列（虚拟变量）
【发布时间】：2015-07-28 15:05:37
【问题描述】：

我有许多包含性别、年龄、诊断等列的 CSV 文件。

目前，它们的编码如下：

ID, gender, age, diagnosis
1,  male,   42,  asthma
1,  male,   42,  anxiety
2,  male,   19,  asthma
3,  female, 23,  diabetes
4,  female, 61,  diabetes
4,  female, 61,  copd

目标是将这些数据转换成这种目标格式：

旁注：如果可能，最好将原始列名添加到新列名之前，例如“age_42”或“gender_female”。

ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1,  1,    0,      1,  0,  0,  0,  1,      1,       0,        0
2,  1,    0,      0,  1,  0,  0,  1,      0,       0,        0
3,  0,    1,      0,  0,  1,  0,  0,      0,       1,        0
4,  0,    1,      0,  0,  0,  1,  0,      0,       1,        1

我尝试使用 reshape2 的 dcast() 函数，但得到的组合导致矩阵极其稀疏。这是一个仅包含年龄和性别的简化示例：

data.train  <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)

ID, male19, male23, male42, male61, female19, female23, female42, female61
1,  0,      0,      1,      0,      0,        0,        0,        0
2,  1,      0,      0,      0,      0,        0,        0,        0
3,  0,      0,      0,      0,      0,        1,        0,        0
4,  0,      0,      0,      0,      0,        0,        0,        1

鉴于这是机器学习数据准备中相当常见的任务，我想可能还有其他库（我不知道）能够执行此转换。

【问题讨论】：

标签： r sparse-matrix reshape2

【解决方案1】：

您需要在此处使用melt/dcast 组合（称为recast），以便将所有列转换为一列并避免组合

library(reshape2)
recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
#   ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1  1  0  0  1  0       1      1    0        0      0    1
# 2  2  1  0  0  0       0      1    0        0      0    1
# 3  3  0  1  0  0       0      0    0        1      1    0
# 4  4  0  0  0  1       0      0    1        1      1    0

根据您的旁注，您可以在此处添加 variable 以便也添加名称

recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
#   ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd
# 1  1             0           1      0      0      1      0                 1                1              0
# 2  2             0           1      1      0      0      0                 0                1              0
# 3  3             1           0      0      1      0      0                 0                0              0
# 4  4             1           0      0      0      0      1                 0                0              1
#   diagnosis_diabetes
# 1                  0
# 2                  0
# 3                  1
# 4                  1

【讨论】：

【解决方案2】：

caret 包中有一个函数可以“虚拟化”数据。

library(caret)
library(dplyr)
predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)

【讨论】：

【解决方案3】：

base R 选项将是

 (!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L
 #     values
 #ID  19 23 42 61 anxiety asthma copd diabetes female male
 # 1  0  0  1  0       1      1    0        0      0    1
 # 2  1  0  0  0       0      1    0        0      0    1
 # 3  0  1  0  0       0      0    0        1      1    0
 # 4  0  0  0  1       0      0    1        1      1    0

如果你也需要原名

 (!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L
 #   Val
 #ID  age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma
 #1      0      0      1      0                 1                1
 #2      1      0      0      0                 0                1
 #3      0      1      0      0                 0                0
 #4      0      0      0      1                 0                0
 #  Val
 #ID  diagnosis_copd diagnosis_diabetes gender_female gender_male
 #1              0                  0             0           1
 #2              0                  0             0           1
 #3              0                  1             1           0
 #4              1                  1             1           0

数据

df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male", 
"male", "male", "female", "female", "female"), age = c(42L, 42L, 
19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma", 
"diabetes", "diabetes", "copd")), .Names = c("ID", "gender", 
"age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")

【讨论】：

+1 以获得良好的原版实现；我也要试试这个。我的数据还没有格式化，但这是一个数据清理问题，而不是结构问题。
更新：接受这个作为答案；在测试它时，它提供了最一致的实现和输出。

【解决方案4】：

使用来自基础 R 的reshape：

d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_")
a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_")
g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_")


new.dat <- cbind(ID=d["ID"],
    g[,grepl("_", names(g))],
    a[,grepl("_", names(a))],
    d[,grepl("_", names(d))])

# convert factors columns to character (if necessary)
# taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character)

new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0
new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1

#  ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma
#1  1           1             0      1      0      0      0                1
#3  2           1             0      0      1      0      0                1
#4  3           0             1      0      0      1      0                0
#5  4           0             1      0      0      0      1                0
#  diagnosis_anxiety diagnosis_diabetes diagnosis_copd
#1                 1                  0              0
#3                 0                  0              0
#4                 0                  1              0
#5                 0                  1              1

【讨论】：

【解决方案5】：

下面是dcast() 和merge() 的稍长方法。由于性别和年龄不是 ID 唯一的，因此创建了一个函数来将其长度转换为虚拟变量 (dum())。另一方面，通过调整公式将诊断设置为唯一计数。

library(reshape2)
data.raw <- read.table(header = T, sep = ",", text = "
id, gender, age, diagnosis
1,  male,   42,  asthma
1,  male,   42,  anxiety
2,  male,   19,  asthma
3,  female, 23,  diabetes
4,  female, 61,  diabetes
4,  female, 61,  copd")

# function to create a dummy variable
dum <- function(x) { if(length(x) > 0) 1 else 0 }

# length of dignosis by id, gender and age
diag <- dcast(data.raw, formula = id + gender + age ~ diagnosis, fun.aggregate = length)[,-c(2,3)]

# length of gender by id
gen <- dcast(data.raw, formula = id ~ gender, fun.aggregate = dum)

# length of age by id
age <- dcast(data.raw, formula = id ~ age, fun.aggregate = dum)

merge(merge(gen, age, by = "id"), diag, by = "id")
#  id   female   male 19 23 42 61   anxiety   asthma   copd   diabetes
#1  1        0      1  0  0  1  0         1        1      0          0
#2  2        0      1  1  0  0  0         0        1      0          0
#3  3        1      0  0  1  0  0         0        0      0          1
#4  4        1      0  0  0  0  1         0        0      1          1

实际上我不太了解您的模型，但您的设置可能太多，因为 R 通过公式对象处理因素。例如，如果性别是响应，则在 R 中将生成以下矩阵。因此，只要您不打算自己适应，适当地设置数据类型和公式就足够了。

data.raw$age <- as.factor(data.raw$age)
model.matrix(gender ~ ., data = data.raw[,-1])
#(Intercept) age23 age42 age61 diagnosis  asthma diagnosis  copd diagnosis  diabetes
#1           1     0     1     0                 1               0                   0
#2           1     0     1     0                 0               0                   0
#3           1     0     0     0                 1               0                   0
#4           1     1     0     0                 0               0                   1
#5           1     0     0     1                 0               0                   1
#6           1     0     0     1                 0               1                   0

如果您需要每个变量的所有级别，您可以通过抑制 model.matrix 中的截距并使用 all-levels-of-a-factor-in-a-model-matrix-in-r 中的小技巧来做到这一点

#  Using Akrun's df1, first change all variables, except ID, to factor
df1[-1] <- lapply(df1[-1], factor)

# Use model.matrix to derive dummy coding
m <- data.frame(model.matrix( ~ 0 + . , data=df1, 
             contrasts.arg = lapply(df1[-1], contrasts, contrasts=FALSE)))

# Collapse to give final solution
aggregate(. ~ ID, data=m, max)

【讨论】：

嗨 Jaehyeon ...我认为不值得将此作为单独的答案添加，并且似乎适合这里。如果您不想要，请回滚编辑。