【发布时间】:2021-11-05 07:10:36
【问题描述】:
我正在处理一个多类分类问题,其中一些类非常不平衡。我的数据如下所示:
product_description class
"This should be used to clean..." 1
"Beauty product, natural..." 2
"Cleaning product, be careful..." 2
"Food, prepared with fruits..." 2
"T-shirt, sports, white, light..." 3
"Cleaning product, used to ..." 2
"Blue pants, two pockets, men..." 3
所以我需要建立一个分类模型。这是我的管道目前的样子:
X = df['product_description']
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
def text_process(mess):
STOPWORDS = stopwords.words("english")
# Check characters to see if they are in punctuation
nopunc = [char for char in mess if char not in string.punctuation]
# Join the characters again to form the string.
nopunc = "".join(nopunc)
# Now just remove any stopwords
return " ".join([word for word in nopunc.split() if word.lower() not in STOPWORDS])
pipe = Pipeline(
steps=[
("vect", CountVectorizer(analyzer= text_process)),
("feature_selection", SelectKBest(chi2, k=20)),
("polynomial", PolynomialFeatures(2)),
("reg", LogisticRegression()),
]
)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
但是,我有一个非常不平衡的数据集,分布如下:1 类 - 80%,2 类 - 10%,3 类 - 5%,4 类 - 4%,5 类 - 1%。所以我正在尝试申请 SMOTE。但是,我仍然不明白应该在哪里应用 SMOTE。
一开始想在Pipeline之前应用SMOTE,结果报错:
ValueError: could not convert string to float: '...'
所以我考虑将 SMOTE 与 Pipeline 结合使用。但我也有一个错误。我尝试在第一步和第二步中使用 SMOTE(),在 CountVectorizer 之后 - 这对我来说似乎是合乎逻辑的 - 但两者都返回了相同的错误:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE()' (type <class 'imblearn.over_sampling._smote.base.SMOTE'>) doesn't
关于如何解决这个问题的任何想法?我在这里缺少什么?
谢谢
【问题讨论】:
标签: python scikit-learn nlp pipeline smote