使用 sklearn 的多元线性回归与成本函数的正态方程不匹配答案

【问题标题】：Multivariate Linear Regression using sklearn not matching the Normal equation of Cost Function使用 sklearn 的多元线性回归与成本函数的正态方程不匹配
【发布时间】：2018-03-23 17:49:17
【问题描述】：

我必须将我的数据拟合到多元线性模型。但是 sklearn.linear_model 产生的答案与 Normal Equation 预测的答案不同。这是两者的代码：

   x=np.arange(12).reshape(3,4)
   y=np.arange(3,6).reshape(3,1)
   x=np.insert(x,0,1,axis=1)
   def normal(X,y):
       return np.dot(np.dot(linalg.pinv(np.dot(X.T,X)),X.T),y)

   normal(x,y)
   >>> [[ 0.4375 ]
       [-0.59375]
       [-0.15625]
       [ 0.28125]
       [ 0.71875]]
   from sklearn import linear_model
   reg=linear_model.LinearRegression()
   reg.fit(x,y)
   reg.coef_
   >>> [[ 0.    ,  0.0625,  0.0625,  0.0625,  0.0625]]

我的代码正确吗？

【问题讨论】：

我不认为normal 函数是正确的。 np.linalg.pinv 返回其输入的伪逆，可以计算为np.linalg.inv(X.T.dot(X)).dot(X.T)。所以你正在做一些逆和伪逆的组合。 normal 应该是return np.linalg.pinv(X).dot(y)。
它包含在不可逆矩阵的情况下。它不会以任何方式影响答案

标签： python machine-learning scikit-learn linear-regression

【解决方案1】：

发生的情况是您在数据矩阵中包含截距项。默认情况下，scikit-learn 的 LinearRegression 类会自动查找截距项，因此您无需在矩阵中插入 1 列：

from sklearn import linear_model
x=np.arange(12).reshape(3,4)
y=np.arange(3,6).reshape(3,1)    
reg=linear_model.LinearRegression()
reg.fit(x,y)

因此我们得到系数和截距项：

In [32]: reg.coef_
Out[32]: array([[ 0.0625,  0.0625,  0.0625,  0.0625]])

In [33]: reg.intercept_
Out[33]: array([ 2.625])

我们可以通过在矩阵的每一行和系数之间做点积来验证我们得到了正确的输出，并在最后加上截距项

In [34]: x.dot(reg.coef_.T) + reg.intercept_
Out[34]:
array([[ 3.],
       [ 4.],
       [ 5.]])

现在，如果您想专门匹配正规方程给您的内容，那很好，您可以插入一列。但是，您需要禁用查找拦截，因为您手动插入了一项可以为您执行此操作的功能。

因此：

x=np.arange(12).reshape(3,4)
y=np.arange(3,6).reshape(3,1)
x=np.insert(x,0,1,axis=1)
reg = linear_model.LinearRegression(fit_intercept=False)
reg.fit(x,y)

通过这样做，我们现在得到了我们的系数：

In [37]: reg.coef_
Out[37]: array([[ 0.4375 , -0.59375, -0.15625,  0.28125,  0.71875]])

这与正规方程的输出相匹配。

【讨论】：