【问题标题】:Understanding warm_start in sklearn MLP了解 sklearn MLP 中的 warm_start
【发布时间】:2019-11-21 15:55:31
【问题描述】:

几点说明:

Python 版本:Python 3.5.0

Sklearn 版本:0.20.3

我正在使用的 sklearn 包中有一个MLPRegressor,它取得了相当不错的效果。

我正在运行的代码如下:

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn import preprocessing
import pandas as pd, numpy as np
import sklearn

def compare_values(arr1, arr2):
    thediff = 0
    thediffs = []
    for thing1, thing2 in zip(arr1, arr2):
        thediff = abs(thing1 - thing2)
        thediffs.append(thediff)

    return thediffs

def robustscale(data):
    scaler = RobustScaler()
    df_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
    return df_scaled  

total_avgs = []

def driver(data, labels, model, scaling):
    best_model = None
    best = 1000000
    avgs = []

    for x in range(5):
        X_train, X_test, y_train, y_test = train_test_split(data, label, shuffle=True, test_size = 0.2)
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        differences = np.average(compare_values(y_test, preds))
        print("CURRENT MODEL Average: {}".format(differences))
        if differences > best:
            best_model = model
        avgs.append(differences)
        total_avgs.append(differences)

    print("Average Performance Overall: {}".format(np.average(avgs)))
    print("Best Performance Overall: {}".format(np.min(avgs)))

data = pd.read_csv('new.csv')

# handle some data manipulation. Dropping columns and such. Nothing important

data = data
rb_data = robustscale(data)

mlp = MLPRegressor(
    activation = 'tanh',
    hidden_layer_sizes = (1000, 1000, 1000),
    alpha = 0.009,
    learning_rate = 'invscaling',
    learning_rate_init = 0.01,
    max_iter = 200,
    momentum = 0.9,
    solver = 'lbfgs',
    warm_start = False
)

print("############################################")
print("NOW TESTING ROBUST SCALE DATA: ")
driver(rb_data, label, mlp, "rb")
print("############################################")

print("\n")

print("BEST MODEL PERFORMANCE: {}".format(np.min(total_avgs)))

我试图理解为什么我在回归问题上得到了如此好的结果。

我的 MLP 是这样配置的(使用 GridSearchCV 后选择的参数)

mlp = MLPRegressor(
    activation = 'tanh',
    hidden_layer_sizes = (1000, 1000, 1000),
    alpha = 0.009,
    learning_rate = 'invscaling',
    learning_rate_init = 0.01,
    max_iter = 200,
    momentum = 0.9,
    solver = 'lbfgs',
    warm_start = True
)

(是的,我也觉得很奇怪 relu 没有被选中。但它从来没有被选中)

当我设置warm_start = True 时,我得到如下输出:

############################################
NOW TESTING ROBUST SCALE DATA:
CURRENT MODEL Average: 21.163831505120193
CURRENT MODEL Average: 12.44361687293673
CURRENT MODEL Average: 5.687720697116947
CURRENT MODEL Average: 4.225979713815092
CURRENT MODEL Average: 5.235999000929669
Average Performance Overall: 9.751429557983725
Best Performance Overall: 4.225979713815092
############################################

显然,每次运行的性能都在提高。

但是,当我设置 warm_start = False 时,我得到:

############################################
NOW TESTING ROBUST SCALE DATA: 
CURRENT MODEL Average: 25.221720858740714
CURRENT MODEL Average: 20.3609370299473
CURRENT MODEL Average: 23.385534335200845
CURRENT MODEL Average: 21.89668702232435
CURRENT MODEL Average: 15.38606220618026
Average Performance Overall: 21.250188290478693
Best Performance Overall: 15.38606220618026
############################################

很明显,warm_start = True 正在以积极的方式影响性能。但是如何?在循环的每次运行中,我都会随机重新拆分我的数据,创建一个全新的模型,然后运行测试。新模式如何向旧模式学习?

【问题讨论】:

  • 来自the docs,您正在创建一个新模型,但告诉回归器“重用之前调用的解决方案以适应初始化”。抱歉,但目前尚不完全清楚您要澄清什么
  • 我也阅读了文档。我想我不明白的是引用的行。 new 模型如何知道有关先前实例化的任何信息?假设,如果我运行 1000 次,我的数据不会只是过度拟合,而不是那么调整吗?我想我不明白新模型如何从以前的模型中学习,以及过度拟合是如何产生的

标签: python machine-learning scikit-learn


【解决方案1】:

简单的解释是你的模型已经“看到”了你在每个循环中测试的数据,并且对它们有一个“记忆”。换句话说,当您使用热启动时,您的测试数据不再独立于您的训练数据,这就是您获得不切实际的好结果的原因。如果您尝试交叉验证设置,则不应使用热启动。测试数据应远离分裂和缩放。在拆分和训练之前缩放整个数据集具有在训练和测试部分之间“泄漏”数据的类似效果。也请看这里:

https://scikit-learn.org/stable/modules/compose.html#pipeline

【讨论】:

    猜你喜欢
    • 2017-04-26
    • 2022-10-31
    • 2020-08-28
    • 2016-03-31
    • 2022-01-09
    • 1970-01-01
    • 1970-01-01
    • 2019-10-10
    • 1970-01-01
    相关资源
    最近更新 更多