如何在 R 中为 Bootstrap Probit 模型编码选择答案

【问题标题】：How to Code Selection for Bootstrap Probit Models in R如何在 R 中为 Bootstrap Probit 模型编码选择
【发布时间】：2014-09-19 18:28:37
【问题描述】：

这个问题是关于如何在具有边际效应的概率模型中编码变量选择（直接或通过调用一些预先存在的包）。

作为与 TLAPD 相关的blog post，我正在对电影的免费和商业可用性对这些电影的盗版水平的影响进行一点概率回归。

在 R 中运行概率的简单方法通常是通过glm，即：

probit <- glm(y ~ x1 + x2, data=data, family =binomial(link = "probit"))

但这对于解释来说是有问题的，因为它不提供边际效应。

通常，如果我想从概率回归中获得边际效应，我会定义这个函数（我不记得原始来源，但它是一个很受欢迎的函数，经常被重新发布）：

mfxboot <- function(modform,dist,data,boot=500,digits=3){
  x <- glm(modform, family=binomial(link=dist),data)
  # get marginal effects
  pdf <- ifelse(dist=="probit",
                mean(dnorm(predict(x, type = "link"))),
                mean(dlogis(predict(x, type = "link"))))
  marginal.effects <- pdf*coef(x)
  # start bootstrap
  bootvals <- matrix(rep(NA,boot*length(coef(x))), nrow=boot)
  set.seed(1111)
  for(i in 1:boot){
    samp1 <- data[sample(1:dim(data)[1],replace=T,dim(data)[1]),]
    x1 <- glm(modform, family=binomial(link=dist),samp1)
    pdf1 <- ifelse(dist=="probit",
                   mean(dnorm(predict(x, type = "link"))),
                   mean(dlogis(predict(x, type = "link"))))
    bootvals[i,] <- pdf1*coef(x1)
  }
  res <- cbind(marginal.effects,apply(bootvals,2,sd),marginal.effects/apply(bootvals,2,sd))
  if(names(x$coefficients[1])=="(Intercept)"){
    res1 <- res[2:nrow(res),]
    res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res1)),nrow=dim(res1)[1])
    rownames(res2) <- rownames(res1)
  } else {
    res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep="")),nrow=dim(res)[1]))
    rownames(res2) <- rownames(res)
  }
  colnames(res2) <- c("marginal.effect","standard.error","z.ratio")
  return(res2)
}

然后像这样运行回归：

mfxboot(modform = "y ~ x1 + x2",
        dist = "probit",
        data = piracy)

但是使用这种方法我不知道我可以运行任何变量选择算法，例如向前、向后、逐步等。

解决此问题的最佳方法是什么？是否有更好的方法在 R 中运行概率，报告边际效应并允许自动模型选择？或者我应该专注于使用mfxboot 并使用该函数进行变量选择？

谢谢！

【问题讨论】：

这个问题的答案需要大量的统计成分。如果您将此迁移到 stats.SE 或在那里提出新问题，我将很乐意回答这个问题。
@fgnu 谢谢，虽然我尝试在 Crossvalidated/stats.SE 上提出一个非常相似/相关的问题，但他们发给我说它对 R 来说太具体了。感谢您在下面的回答。我会投票和评论。

标签： r regression feature-selection

【解决方案1】：

尚不清楚为什么会出现问题。模型（变量）的选择和给定模型的边际效应计算是连续的，没有理由尝试将两者结合起来。

以下是计算边际效应及其自举标准效应后模型（变量）选择的方法：

使用您喜欢的模型选择程序执行变量选择（包括下面讨论的引导模型选择技术，不要与您将用于计算边际效应的标准误差的引导程序混淆最终模型）。

这是this question 中提供的数据集的示例。另请注意，这绝不是对使用逐步变量选择方法的认可。

#================================================
# read in data, and perform variable selection for
#   a probit model
#================================================
dfE = read.csv("ENAE_Probit.csv")
formE = emploi ~ genre + 
  filiere + satisfaction + competence + anglais
glmE = glm(formula = formE, 
           family = binomial(link = "probit"),
           data = dfE)

# perform model (variable) selection
glmStepE = step(object = glmE)

现在将所选模型传递给计算边际效应的函数。

#================================================
# function: compute marginal effects for logit and probit models
# NOTE: this assumes that an intercept has been included by default
#================================================
fnMargEffBin = function(objBinGLM) {
  stopifnot(objBinGLM$family$family == "binomial")
  vMargEff = switch(objBinGLM$family$link, 
                    probit = colMeans(outer(dnorm(predict(objBinGLM, 
                                                         type = "link")),
                                           coef(objBinGLM))[, -1]),
                    logit = colMeans(outer(dlogis(predict(objBinGLM, 
                                                        type = "link")),
                                          coef(objBinGLM))[, -1])
  )
  return(vMargEff)
}

# test the function
fnMargEffBin(glmStepE)

这是输出：

> fnMargEffBin(glmStepE)
     genre    filiere 
0.06951617 0.04571239

要获得边际效应的标准误差，您可以引导边际效应，例如，使用 car 函数中的 Boot 函数，因为它提供了如此简洁的接口来引导从 @ 导出的统计信息987654329@ 估计。

#================================================
# compute bootstrap std. err. for the marginal effects
#================================================
margEffBootE = Boot(object = glmStepE, f = fnMargEffBin, 
     labels = names(coef(glmE))[-1], R = 100)
summary(margEffBootE)

这是输出：

> summary(margEffBootE)
          R original  bootBias   bootSE  bootMed
genre   100 0.069516 0.0049706 0.045032 0.065125
filiere 100 0.045712 0.0013197 0.011714 0.048900

附录：

出于理论上的兴趣，有两种方法可以解释您的自举变量选择问题。

您可以使用引导模型拟合标准作为拟合度量来执行模型选择（变量选择）。 Shao (1996) 中概述了这方面的理论，并且需要二次抽样方法。
然后，您可以根据上面选择的最佳模型计算边际效应及其引导标准误差。
您可以对多个 bootstrap 样本执行变量选择，并通过查看多个 bootstrap 模型选择中保留的变量来得出任一最佳模型，或者使用模型平均估计器。这方面的理论在Austin and Tu (2004) 中讨论。
然后，您可以根据上面选择的最佳模型计算边际效应及其引导标准误差。

【讨论】：

非常感谢您的详细回答。我会 +1，如果没有更好的答案，我会在一段时间后将其标记为解决方案。但是，出于几个原因，我实际上确实想一步完成选择+引导，我应该更好地澄清这一点。你最后提到的原因是其中很大一部分，另一个是我通常处理大数据，运行模型选择需要很长时间，引导程序需要更长的时间，因为我认为有一些方法可以做到这一点认为在一个函数/步骤中估计所有内容可能会更快。