sklearn 库的 train_test_split 函数存在问题答案

【问题标题】：Issue with the train_test_split function of sklearn librarysklearn 库的 train_test_split 函数存在问题
【发布时间】：2020-03-18 03:51:13
【问题描述】：

from sklearn.utils import shuffle
dataset,labels = shuffle(dataset,labels)
print("Shuffling of dataset is completed")
print(" ")

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=lambda doc: doc, lowercase=False)
X = vectorizer.fit_transform([dataset])


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,labels,test_size=0.1,stratify=labels)
print("Completing the splitting of data.")
print(" ")

这是我正在使用 sklearn 库构建的分类模型。

函数 train_test_split() 出现错误，错误状态如下：

ValueError：发现样本数量不一致的输入变量：[1, 88702]

我已尝试通过不同的方法解决此错误，例如更改 X 和 Y 的形状，但我仍然无法获得所需的结果。

【问题讨论】：

你能告诉我X的形状和标签吗？数据集是什么样的？

标签： python scikit-learn classification

【解决方案1】：

这个错误意味着你的 X 向量的长度不等于你的标签向量的长度。如果您打印 X 向量的形状，我希望它的长度为 1。

好像这条线

X = vectorizer.fit_transform([dataset])

应该改为

X = vectorizer.fit_transform(dataset)

根据 sklearn.feature_extraction.text 的文档找到 here

【讨论】：