【发布时间】:2021-06-01 07:28:46
【问题描述】:
当我尝试使用朴素贝叶斯分类器进行预测时,我遇到了尺寸错误。
数据由一列句子和一列情感(又名标签)组成。我想使用一个朴素贝叶斯分类器来预测每个句子的情绪。
我从分离测试、训练和验证数据集开始
import pandas as pd
from sklearn.feature_extraction.text import (CountVectorizer,TfidfVectorizer, TfidfTransformer)
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import SelectKBest, chi2
training_set,sentence_split_further,training_set_sentiments,sentiments_split_further=train_test_split(sentence_data.Sentence,sentence_data.Sentiment,test_size=.5, train_size=.5, random_state=1)
testing_set,validation_set,testing_set_sentiments,validation_set_sentiments=train_test_split(sentence_split_further,sentiments_split_further,test_size=.5, train_size=.5, random_state=1)
然后我创建一个特征矩阵,应用 tfid 并修剪最好的 k 个单词。我在我创建的一个名为 feature_selection_vector 的函数中完成了这一切
tfidf_testing_feature_matrix=feature_selection_vector(testing_set,testing_set_sentiments)
tfidf_validation_feature_matrix=feature_selection_vector(validation_set,validation_set_sentiments)
这是feature_selection_vector函数的代码
def feature_selection_vector( sentence_data, sentiments ):
#creates the feature vector and calculates tfid
vectorizer = CountVectorizer(analyzer='word',
token_pattern=r'\b[a-zA-Z]{3,}\b',
ngram_range=(1, 1)
)
count_vectorized = vectorizer.fit_transform(sentence_data)
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
vectorized = tfidf_transformer.fit_transform(count_vectorized)
vector=pd.DataFrame(vectorized.toarray(),
index=['sentence '+str(i)
for i in range(1, 1+len(sentence_data))],
columns=vectorizer.get_feature_names())
selector = SelectKBest(chi2, k=1000)
selector.fit(vector, sentiments)
return vector
现在我想用训练数据拟合朴素贝叶斯分类器,然后使用模型来使用测试数据进行预测。
naive_bayes = MultinomialNB()
naive_bayes.fit(tfidf_training_feature_matrix,training_set_sentiments)
NBC_tfidf_sentiment_predicted=naive_bayes.predict(tfidf_testing_feature_matrix)
但我不断收到此错误:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 892 is different from 348)
它抱怨的两个尺寸是训练集的列数(892)和测试集的列数(348)
【问题讨论】:
-
回答没有帮助?
标签: python machine-learning scikit-learn text-processing naivebayes