【问题标题】:R's relevel() and factor variables in linear regression in pandas熊猫线性回归中R的relevel()和因子变量
【发布时间】:2015-05-25 09:57:00
【问题描述】:

数据:

a,b,c,d
1,5,9,red
2,6,10,blue
3,7,11,green
4,8,12,red
3,4,3,orange
3,4,3,blue
3,4,3,red

在 R 中,如果我想构建一个考虑分类数据的线性回归模型(我认为它们在 R 中称为因子变量),我可以简单地这样做:

df$d = relevel(df$d, 'green')

之后,为了构建模型,R 会为每种颜色添加列,例如:

dblue
0
1
0
0
0
1
0

将没有绿色列,因为如果所有其他颜色值为 0,则表示绿色 = 1(这是我们的参考级别)。现在,创建一个回归模型:

mod = lm(a ~ b + c + d, data=df)
summary(mod)

Call:
lm(formula = a ~ b + c + d, data = rel)

Residuals:
         1          2          3          4          5          6          7 
 4.708e-16 -7.061e-16  2.219e-31  2.354e-16 -1.233e-31  7.061e-16 -7.061e-16 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -1.600e+00  3.622e-15 -4.418e+14 1.44e-15 ***
b            1.600e+00  9.403e-16  1.702e+15 3.74e-16 ***
c           -6.000e-01  3.766e-16 -1.593e+15 4.00e-16 ***
dblue        8.829e-16  1.823e-15  4.840e-01    0.713    
dorange      1.589e-15  2.294e-15  6.930e-01    0.614    
dred         2.295e-15  1.631e-15  1.407e+00    0.393    

我正在尝试在 Python Pandas 中实现同样的目标。到目前为止,我只提出了这个:

d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], dtype='category')}
df = pd.DataFrame(d)
df['d'] = pd.Categorical(df['d'], ordered=False)
for r in df['d'].cat.categories:
    if r != 'green':
        df['d%s' % r] = df['d'] == r
df = df.drop('d', 1)

它可以工作并产生相同的结果,但我想知道 pandas 中是否有这种方法。

【问题讨论】:

    标签: python r pandas statsmodels


    【解决方案1】:

    你可以使用pd.get_dummies:

    import pandas as pd
    d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 
         'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], 
                        dtype='category')}
    df = pd.DataFrame(d)
    dummies = pd.get_dummies(df['d'])
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(['d', 'green'], axis=1)
    print(df)
    

    产量

       a  b   c  blue  orange  red
    0  1  5   9     0       0    1
    1  2  6  10     1       0    0
    2  3  7  11     0       0    0
    3  4  8  12     0       0    1
    4  3  4   3     0       1    0
    5  3  4   3     1       0    0
    6  3  4   3     0       0    1
    

    使用statsmodels

    import statsmodels.formula.api as smf
    model = smf.ols('a ~ b + c + blue + orange + red', df).fit()
    print(model.summary())
    

    产量

                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      a   R-squared:                       1.000
    Model:                            OLS   Adj. R-squared:                  1.000
    Method:                 Least Squares   F-statistic:                 2.149e+25
    Date:                Sun, 22 Mar 2015   Prob (F-statistic):           1.64e-13
    Time:                        05:57:33   Log-Likelihood:                 200.74
    No. Observations:                   7   AIC:                            -389.5
    Df Residuals:                       1   BIC:                            -389.8
    Df Model:                           5                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    Intercept     -1.6000   6.11e-13  -2.62e+12      0.000        -1.600    -1.600
    b              1.6000   1.59e-13   1.01e+13      0.000         1.600     1.600
    c             -0.6000   6.36e-14  -9.44e+12      0.000        -0.600    -0.600
    blue         1.11e-16   3.08e-13      0.000      1.000     -3.91e-12  3.91e-12
    orange      7.994e-15   3.87e-13      0.021      0.987     -4.91e-12  4.93e-12
    red         4.829e-15   2.75e-13      0.018      0.989     -3.49e-12   3.5e-12
    ==============================================================================
    Omnibus:                          nan   Durbin-Watson:                   0.203
    Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.752
    Skew:                           0.200   Prob(JB):                        0.687
    Kurtosis:                       1.445   Cond. No.                         85.2
    ==============================================================================
    
    Warnings:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    

    或者,你可以use a patsy formula to specify the dummy contrast:

    import pandas as pd
    import statsmodels.formula.api as smf
    
    d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 
         'd': ['red', 'blue', 'green', 'red', 'orange', 'blue', 'red']}
    df = pd.DataFrame(d)
    
    model = smf.ols('a ~ b + c + C(d, Treatment(reference="green"))', df).fit()
    print(model.summary())
    

    参考资料:

    【讨论】:

    • 不像 R 中那么简单,但比我的解决方法要好得多,谢谢!我看到get_dummies 采用prefix 参数,我会使用它来避免列名冲突。
    【解决方案2】:

    也可以这样简化;

    import pandas as pd
    d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 
     'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], 
                    dtype='category')}
    df = pd.DataFrame(d)
    df = pd.get_dummies(df,prefix='color',drop_first=True)
    

    【讨论】:

      猜你喜欢
      • 2019-07-12
      • 2020-05-20
      • 2018-03-29
      • 1970-01-01
      • 2019-11-16
      • 2014-10-31
      • 2018-07-13
      • 2021-07-29
      • 1970-01-01
      相关资源
      最近更新 更多