查找给出最高调整 R 平方值的所有变量答案

【问题标题】：Finding all the variables that give the highest Adjusted R squared value查找给出最高调整 R 平方值的所有变量
【发布时间】：2021-08-19 02:51:20
【问题描述】：

我有一个存储不同变量的数据框。我正在使用 OLS 线性回归并使用所有变量来预测“价格”列。

import pandas as pd
import statsmodels.api as sm

data = {'accommodates':[2, 2, 3, 2, 2, 6, 8, 4, 3, 2],
        'bedrooms':[1, 2, 1, 1, 3, 4, 2, 2, 2, 3],
        'instant_bookable':[1, 0, 1, 1, 1, 1, 0, 0, 0, 1],
        'availability_365':[123, 3, 33, 14, 15, 16, 3, 41, 61, 74],
        'minimum_nights':[3, 12, 1, 4, 6, 7, 2, 3, 6, 10],
        'beds':[2, 2, 3, 4, 1, 5, 6, 2, 3, 2],
        'price':[59, 234, 15, 162, 56, 42, 28, 52, 22, 31]}

df = pd.DataFrame(data, columns = ['accommodates', 'bedrooms', 'instant_bookable', 'availability_365',
                                   'minimum_nights', 'beds', 'price'])

我有一个 for 循环，它计算每个变量的调整后 R 平方值：

fit_d = {}

for columns in [x for x in df.columns if x != 'price']:
    
    Y = df['price']

    X = df[columns]

    X = sm.add_constant(X)

    model = sm.OLS(Y,X, missing = 'drop').fit()
    
    fit_d[columns] = model.rsquared
    

fit_d

我如何修改我的代码以找到给出最大调整 R 平方值的变量组合？理想情况下，该函数会找到具有最大 adj 的变量。首先是 R 平方值，然后使用第一个变量与其余变量迭代以获得 2 个给出最高值的变量，然后是 3 个变量等，直到该值无法进一步增加。我希望输出类似于

Best variables: {'accommodates, 'availability', 'bedrooms'}

【问题讨论】：

你确定这是一个好方法吗：更好的第一个变量然后只寻找与这个变量的组合。我看到的问题是，假设最好的 R 用于容纳，然后添加任何其他列不会增加分数。但也许使用“可用性”和“卧室”的组合会更高的 R，即使它们单独的 R 低于可容纳的 R。所以你的最终解决方案不是最优的。有意义吗？
是的，这完全有道理，但这并不理想。但是为了这种情况，即使它不是最理想的，我也想这样做。但是，正如您提出的那样，我会很好奇更优化的方法！

标签： python pandas linear-regression statsmodels

【解决方案1】：

这里是一个“蛮力方式”，以解决不同长度的987654321 @（来自itertools），以找到具有更高R值的变量。这个想法是做2个循环，一个用于尝试的变量的数量，以及一个用于变量数量的所有组合。

from itertools import combinations

# all possible columns for X
cols = [x for x in df.columns if x != 'price']
# define Y as same accross the loops
Y = df['price']
# define result dictionary
fit_d = {}

# loop for any length of combinations
for i in range(1, len(cols)+1):
    # loop for any combinations with length i
    for comb in combinations(cols, i):
        # Define X from the combination
        X = df[list(comb)]
        X = sm.add_constant(X)
        # perform the OLS opertion
        model = sm.OLS(Y,X, missing = 'drop').fit()
        # save the rsquared in a dictionnary
        fit_d[comb] = model.rsquared

# extract the key for the max R value
key_max = max(fit_d, key=fit_d.get)

print(f'Best variables {key_max} for a R-value of {round(fit_d[key_max], 5)}')
# Best variables ('accommodates', 'bedrooms', 'instant_bookable', 'availability_365', 'minimum_nights', 'beds') for a R-value of 0.78506

【讨论】：