Keras model.fit log 和 Sklearn.metrics.confusion_matrix 报告的验证准确度指标不匹配答案

【问题标题】：Validation accuracy metrics reported by Keras model.fit log and Sklearn.metrics.confusion_matrix don't match each otherKeras model.fit log 和 Sklearn.metrics.confusion_matrix 报告的验证准确度指标不匹配
【发布时间】：2020-01-05 20:52:39
【问题描述】：

问题是我从 Keras model.fit 历史记录中获得的 validation accuracy 报告值明显高于我从 sklearn.metrics 函数获得的 validation accuracy 指标。

我从model.fit得到的结果总结如下：

Last Validation Accuracy: 0.81
Best Validation Accuracy: 0.84

sklearn 的结果（标准化）完全不同：

True Negatives: 0.78
True Positives: 0.77

Validation Accuracy = (TP + TN) / (TP + TN + FP + FN) = 0.775 

(see confusion matrix below for reference)

Edit: this calculation is incorrect, because one can not 
use the normalized values to calculate the accuracy, since 
it does not account for differences in the total absolute 
number of points in the dataset. Thanks to the comment by desertnaut

这是来自 model.fit 历史的验证准确度数据图表：
这是从 sklearn 生成的混淆矩阵：

我觉得这个问题和Sklearn metrics values are very different from Keras values这个问题有点相似但我已经检查过这两种方法都在同一个数据池上进行验证，所以这个答案可能不适合我的情况。

此外，这个问题Keras binary accuracy metric gives too high accuracy 似乎解决了二进制交叉熵影响多类问题的方式的一些问题，但在我的情况下它可能不适用，因为它是一个真正的二进制分类问题。

这里是使用的命令：

模型定义：

inputs = Input((Tx, ))
n_e = 30
embeddings = Embedding(n_x, n_e, input_length=Tx)(inputs)
out = Bidirectional(LSTM(32, recurrent_dropout=0.5, return_sequences=True))(embeddings)
out = Bidirectional(LSTM(16, recurrent_dropout=0.5, return_sequences=True))(out)
out = Bidirectional(LSTM(16, recurrent_dropout=0.5))(out)
out = Dense(3, activation='softmax')(out)
modelo = Model(inputs=inputs, outputs=out)
modelo.summary()

模型总结：

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 30)           86610     
_________________________________________________________________
bidirectional (Bidirectional (None, 100, 64)           16128     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 32)           10368     
_________________________________________________________________
bidirectional_2 (Bidirection (None, 32)                6272      
_________________________________________________________________
dense (Dense)                (None, 3)                 99        
=================================================================
Total params: 119,477
Trainable params: 119,477
Non-trainable params: 0
_________________________________________________________________

模型编译：

mymodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

模型拟合调用：

num_epochs = 30
myhistory = mymodel.fit(X_pad, y, epochs=num_epochs, batch_size=50, validation_data=[X_val_pad, y_val_oh], shuffle=True, callbacks=callbacks_list)

模型拟合日志：

Train on 505 samples, validate on 127 samples

Epoch 1/30
500/505 [============================>.] - ETA: 0s - loss: 0.6135 - acc: 0.6667
[...]
Epoch 10/30
500/505 [============================>.] - ETA: 0s - loss: 0.1403 - acc: 0.9633
Epoch 00010: val_acc improved from 0.77953 to 0.79528, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 41ms/sample - loss: 0.1393 - acc: 0.9637 - val_loss: 0.5203 - val_acc: 0.7953
Epoch 11/30
500/505 [============================>.] - ETA: 0s - loss: 0.0865 - acc: 0.9840
Epoch 00011: val_acc did not improve from 0.79528
505/505 [==============================] - 21s 41ms/sample - loss: 0.0860 - acc: 0.9842 - val_loss: 0.5257 - val_acc: 0.7953
Epoch 12/30
500/505 [============================>.] - ETA: 0s - loss: 0.0618 - acc: 0.9900
Epoch 00012: val_acc improved from 0.79528 to 0.81102, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0615 - acc: 0.9901 - val_loss: 0.5472 - val_acc: 0.8110
Epoch 13/30
500/505 [============================>.] - ETA: 0s - loss: 0.0415 - acc: 0.9940
Epoch 00013: val_acc improved from 0.81102 to 0.82152, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0413 - acc: 0.9941 - val_loss: 0.5853 - val_acc: 0.8215
Epoch 14/30
500/505 [============================>.] - ETA: 0s - loss: 0.0443 - acc: 0.9933
Epoch 00014: val_acc did not improve from 0.82152
505/505 [==============================] - 21s 42ms/sample - loss: 0.0453 - acc: 0.9921 - val_loss: 0.6043 - val_acc: 0.8136
Epoch 15/30
500/505 [============================>.] - ETA: 0s - loss: 0.0360 - acc: 0.9933
Epoch 00015: val_acc improved from 0.82152 to 0.84777, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0359 - acc: 0.9934 - val_loss: 0.5663 - val_acc: 0.8478
[...]
Epoch 30/30
500/505 [============================>.] - ETA: 0s - loss: 0.0039 - acc: 1.0000
Epoch 00030: val_acc did not improve from 0.84777
505/505 [==============================] - 20s 41ms/sample - loss: 0.0039 - acc: 1.0000 - val_loss: 0.8340 - val_acc: 0.8110

来自 sklearn 的混淆矩阵：

from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_values, predicted_values)

预测值和金值确定如下：

preds = mymodel.predict(X_val)
preds_ints = [[el] for el in np.argmax(preds, axis=1)]
values_pred = tokenizer_y.sequences_to_texts(preds_ints)
values_gold = tokenizer_y.sequences_to_texts(y_val)

最后，我想补充一点，我已经打印出数据和所有预测错误，我相信 sklearn 值更可靠，因为它们似乎与我打印出保存的预测结果相匹配“最佳”模型。

另一方面，我无法理解指标为何会如此不同。由于它们都是非常知名的软件，因此我得出的结论是我在这里犯了错误，但我无法确定在哪里或如何。

【问题讨论】：

你的指标是什么：acc 在你的 keras 部分？以及如何计算预测值？
由于我们无法访问每个类中的数据量，我们无法真正将 keras 准确性与 sklearn 混淆矩阵进行比较...我找不到有关它的文档，但从记忆中，keras 准确性是每批之间的平均准确度。例如 => 1 epoch 你有 10 巴赫。在第一批你有 80% 的准确度，模型调整权重，所以你在第二批等时有 81% 的准确度......输出准确度将小于在 epoch 结束时对所有数据计算的准确度
您实际上并没有显示 scikit-learn 的准确性（TP 和 TN 是不是准确性）；另外，通过这种“手动”实验，您实际上可以区分 0.84 和 0.78 的精度是非常值得怀疑的。请显示您的confusion_matrix 命令的实际（即未标准化）输出；另外，使用 scikit-learn accuracy_score 方法 - 使用结果更新您的帖子
@PV8 metric: 如编译行所示，是'acc'
你计算的准确率错误（这个(TP+TN)/2到底是从哪里来的？？）；请参阅下面的答案。如果与 Keras 的差异仍然存在，请不要更改问题 - 这会使答案无效，这确实解决了您的方法中的问题。相反，请打开一个新问题。

标签： python machine-learning keras scikit-learn classification

【解决方案1】：

您的问题不恰当；如前所述，您尚未计算 scikit-learn 模型的实际准确性，因此您似乎将苹果与橙子进行了比较。归一化混淆矩阵的计算 (TP + TN)/2 确实没有给出准确度。这是一个使用玩具数据的简单演示，改编自plot_confusion_matrixdocs：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# toy data
y_true = [0, 1, 0, 1, 0, 0, 0, 1]
y_pred =  [1, 1, 1, 0, 1, 1, 0, 1]
class_names=[0,1]

# plot_confusion_matrix function

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

计算归一化混淆矩阵得到：

plot_confusion_matrix(y_true, y_pred, classes=class_names, normalize=True)
# result:
Normalized confusion matrix
[[ 0.2         0.8       ]
 [ 0.33333333  0.66666667]]

根据您的不正确理由，准确度应该是：

(0.67 + 0.2)/2
# 0.435

（注意在归一化矩阵中，行是如何增加到 100% 的，这在完全混淆矩阵中不会发生）

但现在让我们看看未归一化混淆矩阵的真实准确度是多少：

plot_confusion_matrix(y_true, y_pred, classes=class_names) # normalize=False by default
# result
Confusion matrix, without normalization
[[1 4]
 [1 2]]

从中，根据准确度的定义为（TP + TN）/（TP + TN + FP + FN），我们得到：

(1+2)/(1+2+4+1)
# 0.375

当然，我们不需要混淆矩阵来获得像准确性这样基本的东西；正如 cmets 中已经建议的那样，我们可以简单地使用 scikit-learn 的内置 accuracy_score 方法：

from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
# 0.375

不出所料，这与我们从混淆矩阵中直接计算的结果一致。

底线：

在存在特定方法（如 accuracy_score）的情况下，绝对最好使用它们而不是临时灵感，尤其是当某些事情看起来不正确时（如 Keras 和 scikit- 之间的差异）了解报告的准确性）
在此示例中，实际准确度低于您自己计算的准确度这一事实显然不能说明您报告的具体问题
如果即使在为您的数据计算了正确的准确性之后仍然存在与 Keras 的差异，请不要根据新情况更改问题，因为这会使答案无效，尽管事实上它突出了您方法中的一个错误点 - 改为打开一个新问题

【讨论】：