在R中来回虚拟变量答案

【问题标题】：back and forth to dummy variables in R在R中来回虚拟变量
【发布时间】：2012-12-05 19:52:26
【问题描述】：

所以，两年来我一直在断断续续地使用 R，并试图了解矢量化的整个想法。由于我经常处理来自调查的多个响应集中的虚拟变量，我认为学习这个案例会很有趣。

这个想法是从 multiple response 到 dummy variables（然后返回），例如：“在这 8 种不同的巧克力中，哪些是你最喜欢的（选择最多 3) ?"

有时我们将其编码为虚拟变量（人喜欢“Cote d'Or”，1，人不喜欢0 '不喜欢它），每个选项有 1 个变量，有时是分类变量（人喜欢“Cote d'Or”， 2 代表人喜欢“Lindt”，依此类推），3 个变量对应 3 个选项。

所以，基本上我可以得到一个矩阵，其中的行类似于

1,0,0,1,0,0,1,0

或者像这样的线的矩阵

1,4,7

如上所述，这个想法是从一个到另一个。到目前为止，我得到了每个案例的循环解决方案和从虚拟到分类的矢量化解决方案。我将不胜感激对此问题的任何进一步见解以及分类到虚拟步骤的矢量化解决方案。

虚拟到非虚拟

vecOrig<-matrix(0,nrow=18,ncol=8)  # From this one
vecDest<-matrix(0,nrow=18,ncol=3)  # To this one

# Populating the original matrix.
# I'm pretty sure this could have been added to the definition of the matrix, 
# but I kept getting repeated numbers.
# How would you vectorize this?
for (i in 1:length(vecOrig[,1])){               
vecOrig[i,]<-sample(vec)
}

# Now, how would you vectorize this following step... 
for(i in 1:length(vecOrig[,1])){            
  vecDest[i,]<-grep(1,vecOrig[i,])
}

# Vectorized solution, I had to transpose it for some reason.
vecDest2<-t(apply(vecOrig,1,function(x) grep(1,x)))

不假对假

matOrig<-matrix(0,nrow=18,ncol=3)  # From this one
matDest<-matrix(0,nrow=18,ncol=8)  # To this one.

# We populate the origin matrix. Same thing as the other case. 
for (i in 1:length(matOrig[,1])){         
  matOrig[i,]<-sample(1:8,3,FALSE)
}

# this works, but how to make it vectorized?
for(i in 1:length(matOrig[,1])){          
  for(j in matOrig[i,]){
    matDest[i,j]<-1
  }
}

# Not a clue of how to vectorize this one. 
# The 'model.matrix' solution doesn't look neat.

【问题讨论】：

问题：为什么要这样做？最终目标是什么？
哈哈，第一个答案：学习。下一步：根据需要调整数据。另外：培养 R 能力！
在这种特殊情况下，我有一个包含 239 个变量和 2000 多个案例的数据库。一些变量被编码为虚拟变量，而另一些变量被编码为分类变量。我使用 R，但作为一个团队，我们使用 SPSS。很多时候，我们需要为 SPSS 中的某些计算（聚类分析、MCA 等）获取“其他”版本。

标签： r vectorization dummy-data

【解决方案1】：

矢量化解决方案：

虚拟到非虚拟

vecDest <- t(apply(vecOrig == 1, 1, which))

Not dummy to dummy（回到原来的）

nCol <- 8

vecOrig <- t(apply(vecDest, 1, replace, x = rep(0, nCol), values = 1))

【讨论】：

谢谢，这看起来像我想要的... 第一个看起来和我的相似，但语法更复杂。第二个就是我要找的那个！干杯。
任何人都可以提供关于为什么必须转置的见解？
它必须被转置，因为apply 自动使用返回的向量作为新矩阵的列。

【解决方案2】：

这可能会为第一部分提供一些内部信息：

#Create example data
set.seed(42)
vecOrig<-matrix(rbinom(20,1,0.2),nrow=5,ncol=4)

     [,1] [,2] [,3] [,4]
[1,]    1    0    0    1
[2,]    1    0    0    1
[3,]    0    0    1    0
[4,]    1    0    0    0
[5,]    0    0    0    0

请注意，这并不假定每行中的个数相等（例如，您写了“最多选择 3”）。

#use algebra to create position numbers
vecDest <- t(t(vecOrig)*1:ncol(vecOrig))

     [,1] [,2] [,3] [,4]
[1,]    1    0    0    4
[2,]    1    0    0    4
[3,]    0    0    3    0
[4,]    1    0    0    0
[5,]    0    0    0    0

现在，我们删除零。因此，我们必须将对象变成一个列表。

vecDest <- split(t(vecDest), rep(1:nrow(vecDest), each = ncol(vecDest)))
lapply(vecDest,function(x) x[x>0])

$`1`
[1] 1 4

$`2`
[1] 1 4

$`3`
[1] 3

$`4`
[1] 1

$`5`
numeric(0)

【讨论】：