如何规范化model.matrix？答案

【问题标题】：How to normalize a model.matrix?如何规范化model.matrix？
【发布时间】：2015-06-18 14:20:02
【问题描述】：

# first, create your data.frame
mydf <- data.frame(a = c(1,2,3), b = c(1,2,3), c = c(1,2,3))

# then, create your model.matrix
mym <- model.matrix(as.formula("~ a + b + c"), mydf)

# how can I normalize the model.matrix?

目前，我必须将我的 model.matrix 转换回 data.frame 才能运行我的规范化函数：

normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) }
m.norm <- as.data.frame(lapply(m, normalize))

有没有办法通过简单地规范化 model.matrix 来避免这一步？

【问题讨论】：

m 是什么？是mym 还是一些转换后的数据集？
@DavidArenburg m 是 mym 的真实代表。 m 来自我的实际代码，而 mym 是我为这个问题弥补的。

标签： r model.matrix

【解决方案1】：

您可以使用apply 函数对每一列进行规范化，而无需转换为数据框：

apply(mym, 2, normalize)
#   (Intercept)   a   b   c
# 1         NaN 0.0 0.0 0.0
# 2         NaN 0.5 0.5 0.5
# 3         NaN 1.0 1.0 1.0

您可能实际上希望保持拦截不变，例如：

cbind(mym[,1,drop=FALSE], apply(mym[,-1], 2, normalize))
#   (Intercept)   a   b   c
# 1           1 0.0 0.0 0.0
# 2           1 0.5 0.5 0.5
# 3           1 1.0 1.0 1.0

【讨论】：

谢谢，这似乎运作良好。但是，您能解释或简化mym[,1,drop=F] 周围的语法吗？我不明白那在做什么。另外，为什么apply()中需要mym[,-1]？
我看到使用 mym[,-1] 是因为您不想规范化“（拦截）”列。但是，mym[,1,drop=F] 呢？
我现在看到了，drop logical. If TRUE the result is coerced to the lowest possible dimension. The default is to drop if only one column is left, but not to drop if only one row is left. 基本上，我们只是将规范化的列与原始的 Intercept 列结合起来。
@user1477388 你明白了。
我认为这里实际上不需要drop。 cbind 函数将数值向量转换为列矩阵。

【解决方案2】：

另一个选项是使用非常有用的matrixStats 包对其进行矢量化（尽管 TBHapply 通常在矩阵和列上应用时也非常有效）。这样您也可以保留原始数据结构

library(matrixStats)
Max <- colMaxs(mym[, -1]) 
Min <- colMins(mym[, -1])
mym[, -1] <- (mym[, -1] - Min)/(Max - Min)
mym
#   (Intercept)   a   b   c
# 1           1 0.0 0.0 0.0
# 2           1 0.5 0.5 0.5
# 3           1 1.0 1.0 1.0
# attr(,"assign")
# [1] 0 1 2 3

【讨论】：

【解决方案3】：

如果你想在某种意义上“标准化”，你可以使用scale 函数，它将std.dev 居中并将其设置为1。

> scale( mym )
  (Intercept)  a  b  c
1         NaN -1 -1 -1
2         NaN  0  0  0
3         NaN  1  1  1
attr(,"assign")
[1] 0 1 2 3
attr(,"scaled:center")
(Intercept)           a           b           c 
          1           2           2           2 
attr(,"scaled:scale")
(Intercept)           a           b           c 
          0           1           1           1 
> mym
  (Intercept) a b c
1           1 1 1 1
2           1 2 2 2
3           1 3 3 3
attr(,"assign")
[1] 0 1 2 3

如您所见，当存在“截距”项时，将所有模型矩阵“归一化”是没有意义的。所以你可以这样做：

> mym[ , -1 ] <- scale( mym[,-1] )
> mym
  (Intercept)  a  b  c
1           1 -1 -1 -1
2           1  0  0  0
3           1  1  1  1
attr(,"assign")
[1] 0 1 2 3

如果您的默认对比选项设置为“contr.sum”并且列是因子类型，这实际上是模型矩阵。如果要“标准化”的变量是因素，则这仅作为内部到model.matrix 操作被接受：

> mym <- model.matrix(as.formula("~ a + b + c"), mydf, contrasts.arg=list(a="contr.sum"))
Error in `contrasts<-`(`*tmp*`, value = contrasts.arg[[nn]]) : 
  contrasts apply only to factors
> mydf <- data.frame(a = factor(c(1,2,3)), b = c(1,2,3), c = c(1,2,3))
> mym <- model.matrix(as.formula("~ a + b + c"), mydf, contrasts.arg=list(a="contr.sum"))
> mym
  (Intercept) a1 a2 b c
1           1  1  0 1 1
2           1  0  1 2 2
3           1 -1 -1 3 3
attr(,"assign")
[1] 0 1 1 2 3
attr(,"contrasts")
attr(,"contrasts")$a
[1] "contr.sum"

【讨论】：