【问题标题】：Logistic regression results different in Scikit python and R?Scikit python和R中的逻辑回归结果不同？
【发布时间】：2016-10-18 19:39:44
【问题描述】：

我在 R 和 Python 上对 iris 数据集运行逻辑回归。但两者都给出不同的结果（系数、截距和分数）。

#Python codes.
    In[23]: iris_df.head(5)
    Out[23]: 
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
    0           5.1          3.5           1.4          0.2        0
    1           4.9          3.0           1.4          0.2        0
    2           4.7          3.2           1.3          0.2        0
    3           4.6          3.1           1.5          0.2        0
    In[35]: iris_df.shape
    Out[35]: (100, 5)
    #looking at the levels of the Species dependent variable..

        In[25]: iris_df['Species'].unique()
        Out[25]: array([0, 1], dtype=int64)

    #creating dependent and independent variable datasets..

        x = iris_df.ix[:,0:4]
        y = iris_df.ix[:,-1]

    #modelling starts..
    y = np.ravel(y)
    logistic = LogisticRegression()
    model = logistic.fit(x,y)
    #getting the model coefficients..
    model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
    model_intercept = model.intercept_
    In[30]: model_coef
    Out[36]: 
                  0                  1
    0  Sepal.Length  [-0.402473917528]
    1   Sepal.Width   [-1.46382924771]
    2  Petal.Length    [2.23785647964]
    3   Petal.Width     [1.0000929404]
    In[31]: model_intercept
    Out[31]: array([-0.25906453])
    #scores...
    In[34]: logistic.predict_proba(x)
    Out[34]: 
    array([[ 0.9837306 ,  0.0162694 ],
           [ 0.96407227,  0.03592773],
           [ 0.97647105,  0.02352895],
           [ 0.95654126,  0.04345874],
           [ 0.98534488,  0.01465512],
           [ 0.98086592,  0.01913408],

R 代码。

> str(irisdf)
'data.frame':   100 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : int  0 0 0 0 0 0 0 0 0 0 ...

 > model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
> summary(model)

Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-1.681e-05  -2.110e-08   0.000e+00   2.110e-08   2.006e-05  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)       6.556 601950.324       0        1
Sepal.Length     -9.879 194223.245       0        1
Sepal.Width      -7.418  92924.451       0        1
Petal.Length     19.054 144515.981       0        1
Petal.Width      25.033 216058.936       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3863e+02  on 99  degrees of freedom
Residual deviance: 1.3166e-09  on 95  degrees of freedom
AIC: 10

Number of Fisher Scoring iterations: 25

由于收敛问题，我增加了最大迭代次数并将epsilon设为0.05。

> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)

Call:
glm(formula = Species ~ ., family = binomial, data = irisdf, 
    control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-0.0102793  -0.0005659  -0.0000052   0.0001438   0.0112531  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)     1.796    704.352   0.003    0.998
Sepal.Length   -3.426    215.912  -0.016    0.987
Sepal.Width    -4.208    123.513  -0.034    0.973
Petal.Length    7.615    159.478   0.048    0.962
Petal.Width    11.835    285.938   0.041    0.967

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3863e+02  on 99  degrees of freedom
Residual deviance: 5.3910e-04  on 95  degrees of freedom
AIC: 10.001

Number of Fisher Scoring iterations: 12

#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
           1            2            3            4            5 
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08

R 和 python 中的分数、截距和系数都完全不同。哪个是正确的，我想在 python 中继续。现在混淆了哪个结果是准确的。

【问题讨论】：

我能想到的一个可能原因可能是 R 和 Sklearn 用于 MLE 的不同无约束优化方法；这可能会导致对数似然函数最终处于不同的局部最优值。
您可能还想关注类似问题的开发stackoverflow.com/questions/37872536/…

标签： python r machine-learning regression logistic-regression

【解决方案1】：

更新问题是沿着花瓣宽度变量存在完美的分离。换句话说，这个变量可以用来完美地预测给定数据集中的样本是 setosa 还是 versicolor。这打破了 R 中逻辑回归中使用的对数似然最大化估计。问题是通过将花瓣宽度的系数设为无穷大，可以将对数似然驱动得非常高。

一些背景和策略是discussed here。

还有一个很好的thread on CrossValidated讨论策略。

那么为什么 sklearn LogisticRegression 有效？因为它采用“正则化逻辑回归”。正则化会惩罚估计较大的参数值。

在下面的示例中，我使用 Firth 的减少偏差的逻辑回归包 logistf 方法来生成收敛模型。

library(logistf)

iris = read.table("path_to _iris.txt", sep="\t", header=TRUE)
iris$Species <- as.factor(iris$Species)
sapply(iris, class)

model1 <- glm(Species ~ ., data = irisdf, family = binomial)
# Does not converge, throws warnings.

model2 <- logistf(Species ~ ., data = irisdf, family = binomial)
# Does converge.

原创根据 R 解决方案中的 std.error 和 z 值，我认为您的模型规范不好。接近 0 的 z 值基本上告诉您模型和因变量之间没有相关性。所以这是一个荒谬的模型。

我的第一个想法是您需要将该 Species 字段转换为分类变量。在您的示例中，它是 int 类型。尝试使用as.factor

How to convert integer into categorical data in R?

【讨论】：

我尝试将因变量转换为因子。但结果是一样的。
奇怪。使用的是哪个 Iris 数据集？我今天可以看看。
顺便说一句，我应该在没有看到数据的情况下抓住它。提示在警告消息中：“出现数字 0 或 1 的拟合概率”。
另一条线索是Petal.Width 系数估计值很大（逻辑回归中任何 >10 都表明可能正在发生完美分离......）
这两种方法产生不同的结果，因为它们最大化了惩罚似然函数的不同公式。如果这个术语“惩罚可能性”对您没有任何意义，您可以对回归收缩方法进行一些研究。 ISLR 的第 6.2 章很好地介绍了 Ridge 和 Lasso 回归。它们不用于逻辑回归，但您应该明白。