您是否考虑过使用caret 的dummyVars?它对我有用,而且看起来相当快。
?dummyVars 比较了model.matrix 和dummyVars 的默认行为,但没有多说。
对于reproducible example 的小型性能基准测试:
n = 1e3 # observations
m = 1e2 # variables
some_levels <- sort(c(LETTERS, letters))
library('microbenchmark')
set.seed(1234)
df <- data.frame(
lapply(1:m, function(x){
switch(sample.int(3,1),
# "some continuous, some 0-1"
'1' = rnorm(n), '2' = rbinom(n, 1, 0.5),
# "some factors with many levels"
'3' = factor(sample(some_levels, n, TRUE),
levels=some_levels )
)
})
)
names(df) <- paste0('V',1:m)
#------------- it sounds like you are doing something like this --------------
frm <- as.formula( paste('~', paste(names(df), collapse='+') ) )
library('Matrix')
microbenchmark(
mm <- sparse.model.matrix(frm, df)
) # mean = .133 sec (YMMV)
#---------------- you could try something like this --------------------------
library('caret')
microbenchmark(
mm2 <- dummyVars(frm, df, fullRank=TRUE)
) # mean = .00954 sec (YMMV)
注意fullRank = TRUE,以便“因子被编码为与model.matrix 一致,因此[原文如此] 在列之间没有引起线性依赖关系”,每?dummyVars。您可能希望删除 fullRank = TRUE 以诱导 sparse=TRUE 在 contr.ltrf 中的行为,就像在 sparse.model.matrix 中一样。我找不到明确的文件。