【发布时间】:2021-09-24 22:13:15
【问题描述】:
我目前正在研究一个多标签文本分类问题,其中我有 4 个标签,表示为 4 个虚拟变量。我已经尝试了几种方法来以适合制作 MLC 的方式转换数据。
现在我正在使用管道运行,但据我所知,这不适合包含所有标签的模型,而是每个标签生成 1 个模型 - 你同意吗?
我曾尝试使用MultiLabelBinarizer 和LabelBinarizer,但没有成功。
您对我如何解决这个问题有什么建议,让模型在一个模型中包含所有标签,同时考虑到不同的标签组合?
数据子集和我的代码在这里:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Import data
df = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text
categories = ['TV','Internet','Mobil','Fastnet']
# Model
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
])
for category in categories:
print('... Processing {}'.format(category))
LogReg_pipeline.fit(X_train, train[category])
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
【问题讨论】:
标签: python scikit-learn multilabel-classification