如何正确绘制训练和验证集的损失曲线？答案

【问题标题】：how to plot correctly loss curves for training and validation sets?如何正确绘制训练和验证集的损失曲线？
【发布时间】：2021-02-07 12:06:23
【问题描述】：

我想以与 Keras 相同的方式为我的训练和验证集绘制损失曲线，但使用的是 Scikit。我选择了一个回归问题的具体数据集，该数据集可在以下位置获得：

http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/

所以，我已将数据转换为 CSV，我的程序的第一个版本如下：

模型 1

df=pd.read_csv("Concrete_Data.csv")
train,validate,test=np.split(df.sample(frac=1),[int(.8*len(df)),int(.90*len(df))])
Xtrain=train.drop(["ConcreteCompStrength"],axis="columns")
ytrain=train["ConcreteCompStrength"]
Xval=validate.drop(["ConcreteCompStrength"],axis="columns")
yval=validate["ConcreteCompStrength"]
mlp=MLPRegressor(activation="relu",max_iter=5000,solver="adam",random_state=2)
mlp.fit(Xtrain,ytrain)

plt.plot(mlp.loss_curve_,label="train")
mlp.fit(Xval,yval)                           #doubt
plt.plot(mlp.loss_curve_,label="validation") #doubt
plt.legend()

结果图如下：

在这个模型中，我怀疑它是否是正确的标记部分，因为只要我知道应该将验证集或测试集分开，那么拟合函数可能在那里不正确。我得到的分数是 0.95。

模型 2

在这个模型中，我尝试使用如下验证分数：

df=pd.read_csv("Concrete_Data.csv")
train,validate,test=np.split(df.sample(frac=1),[int(.8*len(df)),int(.90*len(df))])
Xtrain=train.drop(["ConcreteCompStrength"],axis="columns")
ytrain=train["ConcreteCompStrength"]
Xval=validate.drop(["ConcreteCompStrength"],axis="columns")
yval=validate["ConcreteCompStrength"]
mlp=MLPRegressor(activation="relu",max_iter=5000,solver="adam",random_state=2,early_stopping=True)
mlp.fit(Xtrain,ytrain)

plt.plot(mlp.loss_curve_,label="train")
plt.plot(mlp.validation_scores_,label="validation")   #line changed
plt.legend()

而对于这个模型，我不得不添加early stopping set 为true 和validation_scores_ 的部分进行绘制，但是图形结果有点奇怪：

我得到的分数是 0.82，但我了解到，当模型发现预测验证集中的数据比预测训练集中的数据更容易时，就会发生这种情况。我相信这是因为我使用了 validation_scores_ 部分，但我无法找到任何关于此特定指令的在线参考。

绘制这些损失曲线以在 Scikit 中调整我的超参数的正确方法是什么？

更新我已经按照这样的建议对模块进行了编程：

mlp=MLPRegressor(activation="relu",max_iter=1,solver="adam",random_state=2,early_stopping=True)
training_mse = []
validation_mse = []
epochs = 5000
for epoch in range(1,epochs):
    mlp.fit(X_train, Y_train) 
    Y_pred = mlp.predict(X_train)
    curr_train_score = mean_squared_error(Y_train, Y_pred) # training performances
    Y_pred = mlp.predict(X_valid) 
    curr_valid_score = mean_squared_error(Y_valid, Y_pred) # validation performances
    training_mse.append(curr_train_score) # list of training perf to plot
    validation_mse.append(curr_valid_score) # list of valid perf to plot
plt.plot(training_mse,label="train")
plt.plot(validation_mse,label="validation")
plt.legend()

但得到的图是两条平线：

我好像漏掉了什么。

【问题讨论】：

标签： python machine-learning scikit-learn

【解决方案1】：

您不应该在验证集上拟合您的模型。验证集通常用于决定使用哪些超参数，而不是参数值。

进行训练的标准方法是将数据集分为三部分

培训
验证
测试

例如拆分为 80、10、10 %

通常您会选择一个神经网络（有多少层、节点、什么激活函数），然后仅在训练集上进行训练，在验证中检查结果，然后在测试中进行

我将展示一个伪算法以使其清楚：

for model in my_networks: #hyperparameters selection
    model.fit(X_train, Y_train) # parameters fitting
    model.predict(X_valid) # no train, only check on performances
    save model performances on validation

pick the best model (the one with best scores on the validation set)
then check results on the test set
model.predict(X_test) # this will be the estimated performance of your model

如果你的数据集足够大，你也可以使用交叉验证之类的东西

无论如何，请记住：

参数是网络权重
您将参数与训练集相匹配
超参数是定义网络架构（层、节点、激活函数）的参数
您选择最佳超参数检查您的模型在验证集上的结果
经过此选择（最佳参数、最佳超参数）后，您将获得在测试集上测试模型的模型性能

要获得与keras相同的结果，您应该了解，当您使用默认参数在模型上调用.fit方法时，训练将在固定数量的epochs（200）后停止，您定义的epochs数量（在您的情况下为 5000）或当您定义 early_stopping 时。

max_iter: int, 默认=200

最大迭代次数。求解器迭代直到收敛（由“tol”确定）或迭代次数。为了随机求解器（‘sgd’、‘adam’），注意这决定了时期数（每个数据点将被使用多少次），而不是梯度步数。

在scikit page 上检查您的模型定义和参数

要获得与 keras 相同的结果，您可以修复训练 epoch（例如，每次训练 1 步），检查验证结果，然后再次训练，直到达到所需的 epoch 数

例如，类似这样的事情（如果您的模型使用 mse）：

from sklearn.metrics import mean_squared_error
epochs = 5000

mlp = MLPRegressor(activation="relu", max_iter=1, solver="adam", random_state=2, early_stopping=True)
training_mse = []
validation_mse = []
for epoch in epochs:
    mlp.fit(X_train, Y_train) 
    Y_pred = mlp.predict(X_train)
    curr_train_score = mean_squared_error(Y_train, Y_pred) # training performances
    Y_pred = mlp.predict(X_valid) 
    curr_valid_score = mean_squared_error(Y_valid, Y_pred) # validation performances
    training_mse.append(curr_train_score) # list of training perf to plot
    validation_mse.append(curr_valid_score) # list of valid perf to plot

【讨论】：

谢谢@Nikaido 我明白了，但是如何在这个模型中绘制验证曲线？我不想为此目的使用 CV
@Little 我做了更新。不准确，也许有更快的东西，但它是给你的想法
谢谢@Nikaido，但我相信 mlp 指令应该在 for 循环中，对吗？
现在它适用于部分拟合，但对于像 500 这样的少数时期更明显。感谢您的耐心等待 :)，最后一个问题，您会建议这种情况坚持使用 Keras？绘制此图似乎比 scikit 更容易。有什么建议吗？
@Little，这取决于。如果您需要进行深度学习，最好使用 keras。如果您使用小数据，我认为 scikit learn 会更好。请记住，keras 适用于神经网络。 Scikit learn 没有重点，涵盖了很多不同的模型。这取决于你的目标。不客气：）。不要忘记接受您问题的最佳答案！