垃圾邮件分类的重采样数据集答案

【问题标题】：Resampling dataset for spam classification垃圾邮件分类的重采样数据集
【发布时间】：2021-05-20 11:49:03
【问题描述】：

以下数据集存在类不平衡问题：

Text                             is_it_capital?     is_it_upper?      contains_num?   Label
an example of text                      0                  0               0            0
ANOTHER example of text                 1                  1               0            1
What's happening?Let's talk at 5        1                  0               1            1

和类似的。我有 5000 行/文本（4500 类 0 和 500 类 1）。

我需要重新采样我的课程，但我不知道在我的模型中的哪个位置包含此步骤，因此如果您能看一下并告诉我是否遗漏了某些步骤或您是否发现了，我将不胜感激方法中的任何不一致。

对于火车，测试我正在使用以下内容：

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40)

X 在哪里

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)


# Separating classes

spam = df_train[df_train.Label == 1]
not_spam = df_train[df_train.Label == 0]

# Oversampling  

oversampl = resample(spam,replace=True,n_samples=len(not_spam), random_state=42)

oversampled = pd.concat([not_spam, oversampl])
df_train = oversampled.copy()

输出（错误？）：

              precision    recall  f1-score   support

         0.0       0.94      0.98      0.96      3600
         1.0       0.76      0.52      0.62       400

    accuracy                           0.93      4000
   macro avg       0.86      0.77      0.80      4000

weighted avg       0.92      0.93      0.93      4000

您认为我对数据集进行过采样的步骤有什么问题吗，因为混淆矩阵给了我 400 的支持而不是更高的支持？

很抱歉，这篇文章很长，但我认为值得报告所有步骤，以便更好地了解我所采取的方法。

【问题讨论】：

您似乎没有使用 oversampled 变量来训练您的模型。我认为logR_pipeline.fit(df_train['Text'], df_train['Label']) 这一行应该是logR_pipeline.fit(oversampled['Text'], oversampled['Label'])。
我无法理解您想要回答的问题是什么。您是否寻求有关如何使用过采样的建议？或者关于如何训练你的模型的建议？您对机器学习的熟悉程度如何？
请做一个可执行的例子。我觉得您的代码缺少关键部分，build_confusion_matrix 未定义，c 参数未使用。

标签： python scikit-learn classification text-classification resampling

【解决方案1】：

您的方法没有问题，评估报告显示数据不平衡是正常的。这是因为：

重采样（正确地）仅在训练集上进行，以强制模型更加重视少数类。
（正确地）在遵循原始不平衡分布的测试集上进行评估。重新采样测试集也是错误的，因为评估必须在on the true distribution of the data 完成。

【讨论】：