使用 Keras 和 Theano 处理文本分类中的不平衡数据集答案

【问题标题】：Deal with imbalanced dataset in text classification with Keras and Theano使用 Keras 和 Theano 处理文本分类中的不平衡数据集
【发布时间】：2019-09-07 18:57:09
【问题描述】：

对于 ~20,000 个文本数据集，真假样本为 ~5,000 对 ~1,5000。使用 Keras 和 Theano 构建的双通道 textCNN 进行分类。 F1分数是评估指标。 F1 分数还不错，而混淆矩阵显示真实样本的准确率相对较低（~40%）。但实际上准确预测真实样本非常重要。因此，想设计一个自定义的二元交叉熵损失函数来增加误分类真实样本的权重，使模型更专注于对真实样本的准确预测。

在 model.fit 方法中使用 sklearn 尝试了 class_weight，但效果不佳，因为权重应用于所有样本而不是错误分类的样本。
尝试并调整了这里提到的方法：https://github.com/keras-team/keras/issues/2115，但是损失函数是分类交叉熵，对于二元分类问题效果不佳。尝试将损失函数修改为二进制损失函数，但在输入维度方面遇到了一些问题。

针对错误分类样本的代价敏感损失函数的示例代码为：

def w_categorical_crossentropy(y_true, y_pred, weights):
    nb_cl = len(weights)
    final_mask = K.zeros_like(y_pred[:, 0])
    y_pred_max = K.max(y_pred, axis=1)
    y_pred_max = K.reshape(y_pred_max, (K.shape(y_pred)[0], 1))
    y_pred_max_mat = K.equal(y_pred, y_pred_max)
    for c_p, c_t in product(range(nb_cl), range(nb_cl)):
        final_mask += (weights[c_t, c_p] * y_pred_max_mat[:, c_p] * y_true[:, c_t])
    return K.categorical_crossentropy(y_pred, y_true) * final_mask

实际上，使用 Keras 和 Theano 实现的针对错误分类样本的自定义损失函数对于不平衡数据集非常重要。请帮助解决此问题。谢谢！

【问题讨论】：

标签： python keras binary conv-neural-network text-classification

【解决方案1】：

好吧，当我必须在 keras 中处理不平衡的数据集时，我要做的是首先计算每个类的权重，并在训练期间将它们传递给模型实例。这看起来像这样：

from sklearn.utils import compute_class_weight

w = compute_class_weight('balanced', np.unique(targets), targets)

# here I am adding only two categories with their corresponding weights
# you can spin a loop or continue by hand until you include all of your categories
weights = {
     np.unique(targets)[0] : w[0], # class 0 with weight 0
     np.unique(targets)[1] : w[1]  # class 1 with weight 1 
}

# then during training you do like this
model.fit(x=features, y=targets, {..}, class_weight=weights)

我相信这会解决你的问题。

【讨论】：

嗨 Vasil，实际上我之前尝试过，结果发现即使提高了真实样本的准确度，但权重较低的样本的准确度也会因此下降。这就是为什么我想通过惩罚中分类样本来优化它。你有类似的问题吗？