【问题标题】:Multiple linear regression in pandas statsmodels: ValueErrorpandas statsmodels 中的多元线性回归:ValueError
【发布时间】:2015-03-22 08:37:29
【问题描述】:

数据:https://courses.edx.org/c4x/MITx/15.071x_2/asset/NBA_train.csv

我知道如何使用statsmodels.formula.api 将这些数据拟合到多元线性回归模型:

import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.formula.api as smf
model = smf.ols(formula="W ~ PTS + oppPTS", data=NBA).fit()
model.summary()

但是,我发现这种类似 R 的公式表示法很尴尬,我想使用通常的 pandas 语法:

import pandas as pd
NBA = pd.read_csv("NBA_train.csv")    
import statsmodels.api as sm
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()

使用第二种方法我得到以下错误:

ValueError: shapes (835,2) and (835,2) not aligned: 2 (dim 1) != 835 (dim 0)

为什么会发生以及如何解决?

【问题讨论】:

  • r 语法是 y = x1 + x2。这有什么尴尬?这种表示法在数学方面有些流行
  • 也许 awkward 不是正确的词,但我遇到了异常列名的问题(例如“C-11”)
  • 这些不是正确的变量名,所以可能是你的问题
  • @rawr 如何拟合列的对数? (在 R 中:log(y) ~ x1 + x2)

标签: python pandas


【解决方案1】:

使用sm.OLS(y, X)时,y是因变量,X是 自变量。

在公式W ~ PTS + oppPTS中,W是因变量,PTSoppPTS是自变量。

因此,使用

y = NBA['W']
X = NBA[['PTS', 'oppPTS']]

而不是

X = NBA['W']
y = NBA[['PTS', 'oppPTS']]

import pandas as pd
import statsmodels.api as sm

NBA = pd.read_csv("NBA_train.csv")    
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()

产量

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.942
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     6799.
Date:                Sat, 21 Mar 2015   Prob (F-statistic):               0.00
Time:                        14:58:05   Log-Likelihood:                -2118.0
No. Observations:                 835   AIC:                             4242.
Df Residuals:                     832   BIC:                             4256.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         41.3048      1.610     25.652      0.000        38.144    44.465
PTS            0.0326      0.000    109.600      0.000         0.032     0.033
oppPTS        -0.0326      0.000   -110.951      0.000        -0.033    -0.032
==============================================================================
Omnibus:                        1.026   Durbin-Watson:                   2.238
Prob(Omnibus):                  0.599   Jarque-Bera (JB):                0.984
Skew:                           0.084   Prob(JB):                        0.612
Kurtosis:                       3.009   Cond. No.                     1.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-01-13
    • 2010-11-23
    • 2014-12-19
    • 2014-09-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多