【问题标题】:inconsistent shape error MultiLabelBinarizer on y_test, sklearn multi-label classificationy_test,sklearn多标签分类上的形状错误MultiLabelBinarizer不一致
【发布时间】:2017-11-24 01:38:17
【问题描述】:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC

data = r'C:\Users\...\Downloads\news_v1.xlsx'

df = pd.read_excel(data)
df = pd.DataFrame(df.groupby(["id", "doc"]).label.apply(list)).reset_index()

X = np.array(df.doc)
y = np.array(df.label)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

mlb = preprocessing.MultiLabelBinarizer()
Y_train = mlb.fit_transform(y_train)

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y_train)
predicted = classifier.predict(X_test)

Y_test = mlb.fit_transform(y_test)

print("Y_train: ", Y_train.shape)
print("Y_test: ", Y_test.shape)
print("Predicted: ", predicted.shape)
print("Accuracy Score: ", accuracy_score(Y_test, predicted))

我似乎无法进行任何测量,因为 Y_test 在使用 MultiLabelBinarizer 的 fit_transform 后给出了不同的矩阵维度。

结果和错误:

Y_train:  (1278, 49)
Y_test:  (630, 42)
Predicted:  (630, 49)
Traceback (most recent call last):
  File "C:/Users/../PycharmProjects/MultiAutoTag/classifier.py", line 41, in <module>
    print("Accuracy Score: ", accuracy_score(Y_test, predicted))
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\classification.py", line 174, in accuracy_score
    differing_labels = count_nonzero(y_true - y_pred, axis=1)
  File "C:\ProgramData\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 361, in __sub__
    raise ValueError("inconsistent shapes")
ValueError: inconsistent shapes

查看打印的 Y_test,形状与其他形状不同。我在做什么错,为什么 MultiLabelBinarizer 为 Y_test 返回不同的大小? 提前感谢您的帮助!

编辑 新错误:

Traceback (most recent call last):
  File "C:/Users/../PycharmProjects/MultiAutoTag/classifier.py", line 47, in <module>
    Y_test = mlb.transform(y_test)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 763, in transform
    yt = self._transform(y, class_to_index)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 787, in _transform
    indices.extend(set(class_mapping[label] for label in labels))
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 787, in <genexpr>
    indices.extend(set(class_mapping[label] for label in labels))
KeyError: 'Sanction'

这就是 y_test 的样子:

print(y_test)

[['App'] ['Contract'] ['Pay'] ['App'] 
 ['App'] ['App']
 ['Reports'] ['Reports'] ['Executive', 'Pay']
 ['Change'] ['Reports']
 ['Reports'] ['Issue']]

【问题讨论】:

    标签: numpy scikit-learn text-classification multilabel-classification


    【解决方案1】:

    您应该只在测试数据上调用transform()。 Never fit()或其变化如fit_transform()fit_predict()等等它们只能在培训数据上使用。

    所以改变行:

    Y_test = mlb.fit_transform(y_test)

    Y_test = mlb.transform(y_test)

    说明

    当您致电fit()fit_transform()时,MLB忘记了其先前的学习数据并学习新的提供数据。当Y_trainY_testY_test可能在标签上有问题时可能存在问题。

    在您的情况下,Y_train有49种不同的标签,而Y_test @只有42个不同的标签。但这并不意味着y_test是7个标签Y_trainY_test可能有完全不同的标签集,当二值化结果在42列中,这会影响结果。

    【讨论】:

    • 非常感谢和解释!有用!!我有一个新的错误,但会接受它作为答案。谢谢人 span>
    • @ Otje您可以通过编辑此问题或在新问题中询问新错误。 span>
    • 我用新的错误编辑了这个问题。感谢您的帮助!
    • @ Otje此错误意味着在测试中有一些新标签,这些标签不在火车上。这意味着估算者不会学会对它们进行分类。所以你想如何处理它们? span>
    • 我决定分层shufflesplit用于平等分享火车和测试的目标类。我有一个只有一个不能分开的类的实例,任何althernate解决方案都将非常感激。谢谢! [sklearn]:scikit-learn.org/stable/modules/generated/… span>
    猜你喜欢
    • 2013-12-18
    • 2017-09-01
    • 1970-01-01
    • 2021-04-12
    • 2019-05-27
    • 1970-01-01
    • 1970-01-01
    • 2016-11-20
    • 2021-12-20
    相关资源
    最近更新 更多