不正确的逻辑回归输出答案

【问题标题】：incorrect logistic regression output不正确的逻辑回归输出
【发布时间】：2018-04-11 13:18:36
【问题描述】：

我正在使用列 high.medv（是/否）对波士顿数据进行逻辑回归，这表明列 medv 给出的房价中值是否超过 25。

以下是我的逻辑回归代码。

high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired

`条件到 medv 并将结果存储到一个名为“medv.high”的新变量中

ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)

输出如下：

Deviance Residuals: 
[1]  0

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -22.57   48196.14       0        1
lstat             NA         NA      NA       NA

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 0.0000e+00  on 0  degrees of freedom
Residual deviance: 3.1675e-10  on 0  degrees of freedom
AIC: 2

Number of Fisher Scoring iterations: 21

我还需要：现在我需要使用错误分类率作为两种情况的错误度量：

使用 lstat 作为预测器，并且

使用除 high.medv 和 medv 之外的所有预测变量。但我被困在回归本身

【问题讨论】：

为什么输出不正确？ NA 通常是缺失值、不正确的格式或建模的副产品。分享您的数据样本，以便我们发现问题
波士顿数据在 MASS 包中，@elle - 您的子集中的样本是什么，它似乎不是 ourBoston df 中的变量
@FelipeAlvarenga - 数据在波士顿图书馆（MASS）中可用，'NA 是格式错误的错误输出；你能帮我纠正那个错误吗
@Mike - 我在这里谈论的变量是我创建的“high.medv”（我的代码的第一四行）。
@FelipeAlvarenga 我确实设法摆脱了错误，但我的采样中有一些错误。但是，如果有人仍然可以帮助我处理后面的（错误分类）部分，那就太好了

标签： r

【解决方案1】：

对于每个分类算法，艺术依赖于选择阈值，您将根据该阈值确定结果是positive 还是negative。

当您predicttest 数据集中的结果时，您估计响应变量的概率为 1 或 0。因此，您需要告诉您要削减的位置，threshold，在哪个位置预测变为 1 或 0。

较高的阈值在将案例标记为阳性时更为保守，这使得它不太可能产生误报，更有可能产生假阴性。低阈值则相反。

通常的程序是绘制您感兴趣的比率，例如，真阳性和假阳性之间的对比，然后选择最适合您的比率。

set.seed(666)
# simulation of logistic data
x1 = rnorm(1000)            # some continuous variables 
z  = 1 + 2*x1               # linear combination with a bias
pr = 1/(1 + exp(-z))        # pass through an inv-logit function
y  = rbinom(1000, 1, pr)    

df       = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1

df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities

dat = df[df$train==0,] # test data

要使用错误分类错误来评估您的模型，首先您需要设置一个阈值。为此，您可以使用 pROC 包中的 roc 函数，该函数计算费率并提供相应的阈值：

library(pROC)

rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off

rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds

dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
      0   1
  0  86  20
  1  64 164

您的模型的准确度可以计算为您得到的正确观察次数与样本大小的比率。

【讨论】：