使用多标签文本分类中的所有标签进行预测答案

【问题标题】：Making predictions using all labels in multilabel text classification使用多标签文本分类中的所有标签进行预测
【发布时间】：2021-09-24 22:13:15
【问题描述】：

我目前正在研究一个多标签文本分类问题，其中我有 4 个标签，表示为 4 个虚拟变量。我已经尝试了几种方法来以适合制作 MLC 的方式转换数据。

现在我正在使用管道运行，但据我所知，这不适合包含所有标签的模型，而是每个标签生成 1 个模型 - 你同意吗？

我曾尝试使用MultiLabelBinarizer 和LabelBinarizer，但没有成功。

您对我如何解决这个问题有什么建议，让模型在一个模型中包含所有标签，同时考虑到不同的标签组合？

数据子集和我的代码在这里：

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Import data
df  = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text

categories = ['TV','Internet','Mobil','Fastnet']

# Model
LogReg_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
                ('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
                 ])
    
for category in categories:
    print('... Processing {}'.format(category))
    LogReg_pipeline.fit(X_train, train[category])
    prediction = LogReg_pipeline.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

https://www.transfernow.net/dl/20210921NbWDt3eo

【问题讨论】：

标签： python scikit-learn multilabel-classification

【解决方案1】：

代码分析

使用 OVR（one-vs-rest）的 scikit-learn LogisticRegression 分类器一次只能预测一个输出/标签。由于您一次在多个标签上训练管道中的模型，因此您将为每个标签生成一个经过训练的模型。所有模型的算法本身都是相同的，但您会以不同的方式训练它们。

多输出回归器

多输出回归器可以接受多个独立标签并为每个目标生成一个预测。
输出应该和你的一样，但是你只需要维护一个模型并训练一次。
要使用这种方法，请将您的 LR 模型包装在 MultiOutputRegressor 中。
Here 是一个很好的多输出回归模型教程。

model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)

pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
                ('clf', MultiOutputRegressor(model))])

preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)

combine_data() 为方便起见将所有数据合并到一个 DataFrame 中：

def combine_data(X, Y, y_cols):
    """ X is a dataframe, Y is a np array, y_cols is a list """
    df_out = pd.DataFrame(Y, columns=y_cols)
    df_out.index = X.index
    return pd.concat([X, df_out], axis=1).sort_index()

多项逻辑回归

要同时在所有标签上使用LogisticRegression 分类器，请设置multi_class=multinomial。
softmax 函数用于查找样本属于某个类别的预测概率。
您需要反转标签上的 one-hot 编码以取回分类变量（此处为answer）。如果您在 one-hot 编码之前有原始标签，请使用它。
Here 是一个很好的多项逻辑回归教程。

label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])

# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)

# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)

【讨论】：

您好，感谢您的回复，非常有帮助。我曾尝试使用您的 MultiLogReg 代码，但 X_train 形状为 (42141,) 而 df_labels 为 (55194, 4) 时遇到问题。我需要使 df_labels 中的行与 X_train 中的行匹配，因此它们都变为 42141 行，但无法弄清楚这样做
对。这是因为当 sum 为 0 时，您要从 df 中删除行，而不是从 df_labels 中删除行。您需要从两者中删除行，例如当数据框仍然合并时。
我现在已经尝试过了，但它似乎并没有解决它。在我看来，问题在于我将 df 拆分为训练/测试，而 df_labels 的拆分方式不同......编辑：我现在已经修复了该部分
好收获。对同一数据框中的数据和标签执行所有操作可能是个好主意，然后在完成后提取您需要的内容。
您的模型现在似乎可以正常工作了！现在我得到了数组，所以它应该可以工作。你知道我如何将这个数组转换成更容易解释的输出吗？数组（[[1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], ..., [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 1, 0]])