如何在 Sklearn 管道中使用 SMOTE 来解决 NLP 分类问题？答案

【问题标题】：How can I use SMOTE in a Sklearn Pipeline for a NLP Classification problem?如何在 Sklearn 管道中使用 SMOTE 来解决 NLP 分类问题？
【发布时间】：2021-11-05 07:10:36
【问题描述】：

我正在处理一个多类分类问题，其中一些类非常不平衡。我的数据如下所示：

product_description                  class
"This should be used to clean..."    1
"Beauty product, natural..."         2
"Cleaning product, be careful..."    2
"Food, prepared with fruits..."      2
"T-shirt, sports, white, light..."   3
"Cleaning product, used to ..."      2
"Blue pants, two pockets, men..."    3

所以我需要建立一个分类模型。这是我的管道目前的样子：

X = df['product_description']
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

def text_process(mess):

    STOPWORDS = stopwords.words("english")

    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = "".join(nopunc)

    # Now just remove any stopwords
    return " ".join([word for word in nopunc.split() if word.lower() not in STOPWORDS])

pipe = Pipeline(
steps=[
    ("vect", CountVectorizer(analyzer= text_process)),
    ("feature_selection", SelectKBest(chi2, k=20)),
    ("polynomial", PolynomialFeatures(2)),
    ("reg", LogisticRegression()),
]
)

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print(classification_report(y_test, y_pred))

但是，我有一个非常不平衡的数据集，分布如下：1 类 - 80%，2 类 - 10%，3 类 - 5%，4 类 - 4%，5 类 - 1%。所以我正在尝试申请 SMOTE。但是，我仍然不明白应该在哪里应用 SMOTE。

一开始想在Pipeline之前应用SMOTE，结果报错：

ValueError: could not convert string to float: '...'

所以我考虑将 SMOTE 与 Pipeline 结合使用。但我也有一个错误。我尝试在第一步和第二步中使用 SMOTE()，在 CountVectorizer 之后 - 这对我来说似乎是合乎逻辑的 - 但两者都返回了相同的错误：

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE()' (type <class 'imblearn.over_sampling._smote.base.SMOTE'>) doesn't

关于如何解决这个问题的任何想法？我在这里缺少什么？

谢谢

【问题讨论】：

标签： python scikit-learn nlp pipeline smote

【解决方案1】：

使用像SMOTE 这样的重采样器需要imblearn 版本的Pipeline。

这是因为重采样器必须同时更改X 和y，而普通的sklearn 管道不会这样做。 imblearn 管道通过允许其中间步骤使用 transform 或 sample 来适应（重要的是，重采样只发生在拟合期间，在训练数据上，而不是在以后的转换/预测上）。否则它应该像普通的sklearn 管道一样运行。

【讨论】：