R中的主成分分析答案

【问题标题】：Principal Component Analysis in RR中的主成分分析
【发布时间】：2012-06-04 03:32:25
【问题描述】：

使用prcomp 函数，我如何使用无监督主成分从同一数据集上的数据集派生而来，该数据集分为测试和训练？

train <- sample(1:nrow(auto), 60000)
x <- as.matrix(auto[,-1])  ##Covariates
y <- auto[,1]                   ##Response
pc <- prcomp(x)             ##Find Principal Components

data <- data.frame(y=y, (x %*% pc$rotation[,1:9]))
fit <- glm(y ~ ., data=data[train,], family="binomial")   ##Train It

prediction <- predict(fit, newdata=data) > 0  ##Prediction on Entire Data Set

error <- mean(y[-train]] != prediction[-train])  ##Mean out of Sample error

【问题讨论】：

请阅读faq 和reproducible examples 上的这个问题。我不知道你在做什么，你需要更深入地描述你想要的东西。而且我什至无法运行你的代码来检查你到底在做什么......

标签： r pca

【解决方案1】：

这是一个可重现的例子：

set.seed(1)
want <- sample(50, 40)
Iris <- iris[c(51:100, 101:150), ] ## only keep versicolor and virginica
## take our training and test sets
train <- droplevels(Iris[c((1:50)[want], (51:100)[want]), , drop = FALSE])
test <- droplevels(Iris[c((1:50)[-want], (51:100)[-want]), , drop = FALSE])

## fit the PCA
pc <- prcomp(train[, 1:4])

现在请注意，pc$x 是旋转后的数据。您使用了X %*% pc$rotation（其中X 是训练数据矩阵），但没有先将数据居中，但它们是等价的。在回归中居中预测变量可能很有用。

## create data frame for logistic regression
mydata <- data.frame(Species = train[, "Species"], pc$x)
## ...and fit the model
mod <- glm(Species ~ PC1, data = mydata, family = binomial)

预测PC1上测试集数据的分数；也就是说，使用与形成训练数据的 PC 相同的旋转来旋转测试集。为此，我们可以为类"prcomp" 使用predict() 方法

test.p <- predict(pc, newdata = test[, 1:4])

现在用它来预测类别

pred <- predict(mod, newdata = data.frame(test.p), type = "response")
pred

> pred
         56          66          67          71          72 
0.080427399 0.393133104 0.092661480 0.395813527 0.048277608 
         74          76          82          87          95 
0.226191156 0.333553423 0.003860679 0.617977807 0.029469167 
        106         116         117         121         122 
0.999648054 0.922145431 0.924464339 0.989271655 0.318477762 
        124         126         132         137         145 
0.581235903 0.995224501 0.999770995 0.964825109 0.988121496 
> 1 - pred
          56           66           67           71           72 
0.9195726006 0.6068668957 0.9073385196 0.6041864731 0.9517223918 
          74           76           82           87           95 
0.7738088439 0.6664465767 0.9961393215 0.3820221934 0.9705308332 
         106          116          117          121          122 
0.0003519463 0.0778545688 0.0755356606 0.0107283449 0.6815222382 
         124          126          132          137          145 
0.4187640970 0.0047754987 0.0002290047 0.0351748912 0.0118785036

pred 包含测试观察是 Iris virginica 的概率。请注意，在glm() 中，当响应是一个因素（如本例中）时，该因素的第一级（此处为versicolor）被视为失败或0，第二个和后续级别指示成功或@987654334 @。因为在这个例子中只有两个类，模型参数化为versicolor； 1 - pred 将给出virginica 的预测概率。

我不遵循您在问题中包含的错误计算，因此将由您自行解决。然而，可以通过以下方式生成模型成功的交叉分类表：

> predSpecies <- factor(ifelse(pred >= 0.5, "virginica", "versicolor"))
> table(test$Species, predSpecies)
            predSpecies
             versicolor virginica
  versicolor          9         1
  virginica           1         9

表明我们的模型有两个测试集观察错误。

【讨论】：

【解决方案2】：

第一步，您需要将数据拆分为训练和测试：否则 PC 分数远非独立。

即PCA 旋转仅从 x [train,] 计算！

然后将相同的旋转应用于 x [test,]

正如@Joran 所说，对于其他一切，都需要可重现的代码。

【讨论】：

理论上你不应该在无人监督的情况下拆分它们。
@JohnSmith 如果您首先使用来自测试集的信息来形成 PC，您将如何解释样本外错误？这不会真实反映真实的样本外表现。
@JohnSmith：仅基于每个样本的计算意味着您不需要拆分。使用更多情况的计算需要拆分，这里是居中，可能是缩放以及旋转。