如何在 CARET 中自定义一个模型来执行 PLS-[Classifier] 两步分类模型？答案

【问题标题】：How to custom a model in CARET to perform PLS-[Classifer] two-step classificaton model?如何在 CARET 中自定义一个模型来执行 PLS-[Classifier] 两步分类模型？
【发布时间】：2014-02-01 07:02:52
【问题描述】：

这个问题是同一线程here 的延续。以下是本书中的一个最小工作示例：

Wehrens R. Chemometrics 与 R 中的多变量数据分析自然科学和生命科学。第 1 版。海德堡；纽约：施普林格。 2011 年。（第 250 页）。

示例取自本书及其包ChemometricsWithR。它突出了使用交叉验证技术建模时的一些缺陷。

目标：
一种交叉验证的方法，使用相同的重复 CV 集来执行 PLS 的已知策略，然后通常是 LDA 或类似逻辑回归、SVM、C5.0、CART 的表亲，具有 caret 包的精神。因此，每次调用等待分类器之前都需要 PLS，以便对 PLS score 空间进行分类，而不是对观察本身进行分类。 caret 包中最接近的方法是在使用任何分类器建模之前将PCA 作为预处理步骤。下面是一个 PLS-LDA 程序，只有一个交叉验证来测试分类器的性能，没有 10 倍 CV 或任何重复。下面的代码取自上述书中，但进行了一些更正，否则会引发错误：

library(ChemometricsWithR)
data(prostate)
prostate.clmat <- classvec2classmat(prostate.type) # convert Y to a dummy var

odd <- seq(1, length(prostate.type), by = 2) # training
even <- seq(2, length(prostate.type), by = 2) # holdout test

prostate.pls <- plsr(prostate.clmat ~ prostate, ncomp = 16, validation = "CV", subset=odd)

Xtst <- scale(prostate[even,], center = colMeans(prostate[odd,]), scale = apply(prostate[odd,],2,sd))

tst.scores <- Xtst %*% prostate.pls$projection # scores for the waiting trained LDA to test

prostate.ldapls <- lda(scores(prostate.pls)[,1:16],prostate.type[odd]) # LDA for scores
table(predict(prostate.ldapls, new = tst.scores[,1:16])$class, prostate.type[even])

predictionTest <- predict(prostate.ldapls, new = tst.scores[,1:16])$class)

library(caret)    
confusionMatrix(data = predictionTest, reference= prostate.type[even]) # from caret

输出：

Confusion Matrix and Statistics

          Reference
Prediction bph control pca
   bph       4       1   9
   control   1      35   7
   pca      34       4  68

Overall Statistics

               Accuracy : 0.6564          
                 95% CI : (0.5781, 0.7289)
    No Information Rate : 0.5153          
    P-Value [Acc > NIR] : 0.0001874       

                  Kappa : 0.4072          
 Mcnemar's Test P-Value : 0.0015385       

Statistics by Class:

                     Class: bph Class: control Class: pca
Sensitivity             0.10256         0.8750     0.8095
Specificity             0.91935         0.9350     0.5190
Pos Pred Value          0.28571         0.8140     0.6415
Neg Pred Value          0.76510         0.9583     0.7193
Prevalence              0.23926         0.2454     0.5153
Detection Rate          0.02454         0.2147     0.4172
Detection Prevalence    0.08589         0.2638     0.6503
Balanced Accuracy       0.51096         0.9050     0.6643

但是，混淆矩阵与书中的不匹配，反正书中的代码确实坏了，但是这里的这个对我有用！

注意事项：
虽然这只是一份 CV，但意图是首先就这个方法达成一致，将训练集的sd 和mean 应用于测试集，根据特定数量的 PC 将 PLUS 转换为 PLS 分数ncomp .我希望这发生在插入符号中的每一轮简历中。如果这里作为代码的方法是正确的，那么它可以作为一个最小工作示例的良好开端，同时修改 caret 包的代码。

旁注：
缩放和居中可能会非常混乱，我认为 R 中的一些 PLS 函数在内部进行缩放，有或没有居中，我不确定，因此在插入符号中构建自定义模型时应小心处理以避免缺少或多次缩放或居中（我对这些东西很警惕）。

多重居中/缩放的风险
下面的代码只是为了展示多重居中/缩放如何改变数据，这里只显示了居中，但同样的问题也适用于缩放。

set.seed(1)
x <- rnorm(200, 2, 1)
xCentered1 <- scale(x, center=TRUE, scale=FALSE)
xCentered2 <- scale(xCentered1, center=TRUE, scale=FALSE)
xCentered3 <- scale(xCentered2, center=TRUE, scale=FALSE)
sapply (list(xNotCentered= x, xCentered1 = xCentered1, xCentered2 = xCentered2, xCentered3 = xCentered3), mean)

输出：

xNotCentered    xCentered1    xCentered2    xCentered3 
 2.035540e+00  1.897798e-16 -5.603699e-18 -5.332377e-18

如果我在本课程的某个地方遗漏了什么，请发表评论。谢谢。

【问题讨论】：

我认为插入符号还不支持预处理的客户方法。但是，您可以构建一个包含预处理的自定义模型流：caret.r-forge.r-project.org/custom_models.html
谢谢。我查看了自定义模型，我没有找到附近的复杂情况是在代码中告诉train 需要训练 CV 折叠的分数。通常，我会先尝试 PLS-LDA，如果有效，然后对其他分类器执行相同的操作。它就像一个原型模型。那么能否先提供如何自定义PLS-LDA的代码？
我投票把它移到stackoverflow，我认为它更合适。
@doctorate 您将定义一个自定义模型，该模型将在给定数据集的情况下拟合 pls-lda 模型。然后，您将为您的模型编写一个预测函数，该函数将在给定模型拟合和测试集的情况下进行预测。然后，您将这些函数作为自定义方法提供给 caret，并且 caret 将为您的数据集的每个重新采样处理将正确的数据传递给每个函数。
@Zach：我在旧线程中提到的包具有适合plslda 和predict.plslda 的模型（以及更多功能，如coef 和一些后处理）。不过，目前不支持caret。

标签： r classification cross-validation r-caret

【解决方案1】：

如果您想将这些类型的模型与caret 匹配，则需要在 CRAN 上使用最新版本。上次更新是为了让人们可以在他们认为合适的时候使用non-standard models。

我下面的方法是联合拟合 PLS 和其他模型（我在下面的示例中使用随机森林）并同时调整它们。所以对于每个折叠，使用ncomp 和mtry 的二维网格。

“诀窍”是将 PLS 加载附加到随机森林对象，以便在预测期间使用它们。这是定义模型的代码（仅分类）：

 modelInfo <- list(label = "PLS-RF",
              library = c("pls", "randomForest"),
              type = "Classification",
              parameters = data.frame(parameter = c('ncomp', 'mtry'),
                                      class = c("numeric", 'numeric'),
                                      label = c('#Components', 
                                                '#Randomly Selected Predictors')),
              grid = function(x, y, len = NULL) {
                grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
                            mtry = 1:len)
                grid <- subset(grid, mtry <= ncomp)
                },
              loop = NULL,
              fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                     ## First fit the pls model, generate the training set scores,
                     ## then attach what is needed to the random forest object to 
                     ## be used later
                     pre <- plsda(x, y, ncomp = param$ncomp)
                     scores <- pls:::predict.mvr(pre, x, type = "scores")
                     mod <- randomForest(scores, y, mtry = param$mtry, ...)
                     mod$projection <- pre$projection
                     mod
                   },
                   predict = function(modelFit, newdata, submodels = NULL) {       
                     scores <- as.matrix(newdata)  %*% modelFit$projection
                     predict(modelFit, scores)
                   },
                   prob = NULL,
                   varImp = NULL,
                   predictors = function(x, ...) rownames(x$projection),
                   levels = function(x) x$obsLevels,
                   sort = function(x) x[order(x[,1]),])

这是对train的调用：

 library(ChemometricsWithR)
 data(prostate)

 set.seed(1)
 inTrain <- createDataPartition(prostate.type, p = .90)
 trainX <-prostate[inTrain[[1]], ]
 trainY <- prostate.type[inTrain[[1]]]
 testX <-prostate[-inTrain[[1]], ]
 testY <- prostate.type[-inTrain[[1]]]

 ## These will take a while for these data
 set.seed(2)
 plsrf <- train(trainX, trainY, method = modelInfo,
                preProc = c("center", "scale"),
                tuneLength = 10,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5))

 ## How does random forest do on its own?
 set.seed(2)
 rfOnly <- train(trainX, trainY, method = "rf",
                tuneLength = 10,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5))

只是为了好玩，我得到了：

 > getTrainPerf(plsrf)
   TrainAccuracy TrainKappa method
 1     0.7940423    0.65879 custom
 > getTrainPerf(rfOnly)
   TrainAccuracy TrainKappa method
 1     0.7794082  0.6205322     rf

和

 > postResample(predict(plsrf, testX), testY)
  Accuracy     Kappa 
 0.7741935 0.6226087 
 > postResample(predict(rfOnly, testX), testY)
  Accuracy     Kappa 
 0.9032258 0.8353982

最大

【讨论】：

非常感谢，一件事，我将上面的代码与caret 子目录models 中method="rf" 的代码进行了比较，我发现predict with type = "prob" 所以上面的代码应该是：predict(modelFit, scores, type="prob")?.
你能检查一下下面的PLS-LDA代码吗？我的猜测不是，为什么？因为我在插入符号小插图中尝试了声纳数据，一次使用内置的 method="pls"，第二次使用下面的自定义 PLS-LDA，结果即使到最后一位数字也完全相同，不能，所以代码由于某种原因没有达到预期的效果，所以请您修改。
您的代码看起来不错。我猜结果是相同的（在这种情况下），因为我的plsda 函数可以使用贝叶斯规则产生类概率（使用 klaR 包中的NaiveBayes 函数）。对于两个类，LDA 和朴素贝叶斯可能会产生相同的结果（如果在计算朴素贝叶斯模型时假设正态性）。
很可能你是对的二分类问题，我发布了虹膜数据，没有这样的问题。感谢您的支持。
@topepo：这也可以用于模型集成吗？例如随机森林和SVR？谢谢！

【解决方案2】：

基于Max宝贵的cmets，我觉得有必要有IRIS裁判，以分类出名，更重要的是Species的结果有两个以上的分类，这将是一个很好的选择用于测试插入符号中的 PLS-LDA 自定义模型的数据集：

data(iris)
names(iris)
head(iris)
dim(iris) # 150x5
set.seed(1)
inTrain <- createDataPartition(y = iris$Species,
                               ## the outcome data are needed
                               p = .75,
                               ## The percentage of data in the
                               ## training set
                               list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Iris
## that belong in the training set.
training <- iris[ inTrain,] # 114
testing <- iris[-inTrain,] # 36

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 5,
                     classProbs = TRUE)
set.seed(2)
plsFitIris <- train(Species ~ .,
                   data = training,
                   method = "pls",
                   tuneLength = 4,
                   trControl = ctrl,
                   preProc = c("center", "scale"))
plsFitIris
plot(plsFitIris)


set.seed(2)
plsldaFitIris <- train(Species ~ .,
                      data = training,
                      method = modelInfo,
                      tuneLength = 4,
                      trControl = ctrl,
                      preProc = c("center", "scale"))

plsldaFitIris
plot(plsldaFitIris)

现在比较两个模型：

getTrainPerf(plsFitIris)
  TrainAccuracy TrainKappa method
1     0.8574242  0.7852462    pls
getTrainPerf(plsldaFitIris)
  TrainAccuracy TrainKappa method
1      0.975303  0.9628179 custom
postResample(predict(plsFitIris, testing), testing$Species)
Accuracy    Kappa 
   0.750    0.625 
postResample(predict(plsldaFitIris, testing), testing$Species)
 Accuracy     Kappa 
0.9444444 0.9166667

因此，最终出现了预期的差异，以及指标的改进。所以这将支持 Max 的观点，即由于贝叶斯plsda 函数的概率方法导致的两类问题都导致相同的结果。

【讨论】：

嗨，我如何在插入符号方法中实现 MASS::polr 有序逻辑回归？非常感谢

【解决方案3】：

您需要将 CV 包裹在 PLS 和 LDA 周围。
是的，plsr 和 lda 都以自己的方式集中数据
我仔细查看了caret::preProcess ()：正如现在定义的那样，您将无法使用 PLS 作为预处理方法，因为它是有监督的，但caret::preProcess () 仅使用无监督方法（没有办法交出因变量）。这可能会使修补变得相当困难。
因此，在插入符号框架中，您需要使用自定义模型。

【讨论】：

+1，请注意自定义模型是这样做的方法。如果 lda pls 会进行居中，而 caret 会再次进行居中/缩放，那么我们就有麻烦了（请参阅多重居中/缩放的危险），我想知道 caret 包是否知道这个微妙 i> 问题？如果您提供原型模型 PLS-LDA 的代码，我将不胜感激。
@doctorate：请给我发一封电子邮件到chemmetrie@beleites.de，说明您是否需要.zip for Win 或.tar.gz（或两者），因为您可能不想签出来自 r-forge (> 4GB) 的整个 hyperSpec svn repo 以获得几 kB 的代码...
@doctorate Caret 不会自动居中和缩放您的数据，除非您设置 preProcess=c('center', 'scale')。
@Zach，那么基于 cbeleites，如果lda 会进行居中（自己居中），并且我天真地通过preProc = c("center","scale"), 那么我有麻烦了，正确的？否则，您通常会怎么做才能避免多次居中和缩放的问题？
@cbeleites，非常感谢您的慷慨报价，已发送。

【解决方案4】：

如果场景是自定义PLS-LDA类型的模型，根据Max（CARET的维护者）提供的代码，这段代码有些地方不正确，但我没有弄清楚，因为我在caret 小插图中使用相同的声纳数据集，并尝试使用method="pls" 和另一次使用以下 PLS-LDA 自定义模型重现结果，结果完全相同即使到最后一个数字，这是荒谬的。对于基准测试，需要一个已知的数据集（我认为这里适合虹膜数据集的交叉验证 PLS-LDA，因为它以这种类型的分析而闻名，并且应该在某个地方对其进行交叉验证处理），一切除了有问题的代码之外，应该是相同的（set.seed（xxx）和K-CV repitition的编号），以便正确比较和判断下面的代码：

modelInfo <- list(label = "PLS-LDA",
                  library = c("pls", "MASS"),
                  type = "Classification",
                  parameters = data.frame(parameter = c("ncomp"),
                                          class = c("numeric"),
                                          label = c("#Components")),
                  grid = function(x, y, len = NULL) {
                    grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
                  },
                  loop = NULL,
                  fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                    ## First fit the pls model, generate the training set scores,
                    ## then attach what is needed to the lda object to 
                    ## be used later
                    pre <- plsda(x, y, ncomp = param$ncomp)
                    scores <- pls:::predict.mvr(pre, x, type = "scores")
                    mod <- lda(scores, y, ...)
                    mod$projection <- pre$projection
                    mod
                  },
                  predict = function(modelFit, newdata, submodels = NULL) {       
                    scores <- as.matrix(newdata)  %*% modelFit$projection
                    predict(modelFit, scores)$class
                  },
                  prob = function(modelFit, newdata, submodels = NULL) {       
                    scores <- as.matrix(newdata)  %*% modelFit$projection
                    predict(modelFit, scores)$posterior
                  },
                  varImp = NULL,
                  predictors = function(x, ...) rownames(x$projection),
                  levels = function(x) x$obsLevels,
                  sort = function(x) x[order(x[,1]),])

根据 Zach 的要求，下面的代码是插入符号中的 method="pls"，与插入符号 vigenette 中 CRAN 中的具体示例完全相同：

library(mlbench) # data set from here
data(Sonar)
dim(Sonar) # 208x60
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class,
                               ## the outcome data are needed
                               p = .75,
                               ## The percentage of data in the
                               ## training set
                               list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Sonar
## that belong in the training set.
training <- Sonar[ inTrain,] #157
testing <- Sonar[-inTrain,] # 51

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 3,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)
set.seed(108)
plsFitSon <- train(Class ~ .,
                data = training,
                method = "pls",
                tuneLength = 15,
                trControl = ctrl,
                metric = "ROC",
                preProc = c("center", "scale"))
plsFitSon
plot(plsFitSon) # might be slightly difference than what in the vignette due to radnomness

现在，下面的代码是使用自定义模型PLS-LDA 对声纳数据进行分类的试点运行，这是有问题的，除了与仅使用 PLS 相同的数字之外，预计会得出任何数字：

set.seed(108)
plsldaFitSon <- train(Class ~ .,
                   data = training,
                   method = modelInfo,
                   tuneLength = 15,
                   trControl = ctrl,
                   metric = "ROC",
                   preProc = c("center", "scale"))

现在比较两个模型的结果：

getTrainPerf(plsFitSon)
   TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381    pls
getTrainPerf(plsldaFitSon)
   TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 custom

postResample(predict(plsFitSon, testing), testing$Class)
Accuracy    Kappa 
0.745098 0.491954 
postResample(predict(plsldaFitSon, testing), testing$Class)
Accuracy    Kappa 
0.745098 0.491954

所以，结果是完全一样的，这是不可能的。好像没有添加lda 模型？

【讨论】：

如果我复制并粘贴您的 modelInfo 和 Max 的训练/测试代码，我会得到不同的结果，即 method=modelInfo 和 method='lda'。请在您得到不同结果的地方发布您的代码（包含随机种子）。
@Zach，我添加了声纳示例。但是，如果您尝试了method="pls" 然后modelInfo 这是自定义PLS-LDA，我猜您会得到相同的结果。
我能够在声纳数据中重现相同的内容。请参阅我上面关于为什么（或至少我猜为什么）的评论。
嗨，我如何在插入符号方法中实现MASS::polr有序逻辑回归？非常感谢
对于培训，您似乎修改了 train(X,Y) 表格，但您提供了 train(Class ~ ., ...) 表格。尝试从训练中提取 X, Y 并使用 train(X, Y) 格式。