【发布时间】:2021-10-23 15:23:27
【问题描述】:
我正在处理一个 NLP 问题https://www.kaggle.com/c/nlp-getting-started。我想在train_test_split 之后执行矢量化,但是当我这样做时,生成的稀疏矩阵的大小 = 1,这是不正确的。
我的train_x 设置大小是 (4064, 1) 并且在tfidf.fit_transform 之后我得到
size = 1。这怎么可能??!以下是我的代码:
def clean_text(text):
tokens = nltk.word_tokenize(text) #tokenizing the words
lower = [word.lower() for word in tokens] #converting words to lowercase
remove_stopwords = [word for word in lower if word not in set(stopwords.words('english'))]
remove_char = [word for word in remove_stopwords if word.isalpha()]
lemm_text = [ps.stem(word) for word in remove_char] #lemmatizing the words
cleaned_data = " ".join([str(word) for word in lemm_text])
return cleaned_data
x['clean_text']= x["text"].map(clean_text)
x.drop(['text'], axis = 1, inplace = True)
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2, random_state = 69,
stratify = y)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
tfidf = TfidfVectorizer()
train_x_vect = tfidf.fit_transform(train_x)
test_x1 = tfidf.transform(test_x)
pd.DataFrame.sparse.from_spmatrix(train_x_vect,
index=train_x.index,
columns=tfidf.get_feature_names())
当我尝试将稀疏矩阵(大小 = 1)转换为数据框时,它给了我错误。
数据框x 的大小 = 4064,我的稀疏矩阵的大小 = 1,这就是它给我错误的原因。任何帮助将不胜感激!
【问题讨论】:
标签: python nlp vectorization tfidfvectorizer data-preprocessing