【发布时间】:2021-06-11 22:54:11
【问题描述】:
任务是对所有文本和分类特征进行编码,然后再次将它们组合以形成数据矩阵,但出现错误不兼容的行维度。
到目前为止我的工作:
使用标签编码器对分类特征进行编码
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
enc.fit(x_train[' Round'])
round_train_le = enc.transform(x_train[' Round'])
round_test_le = enc.transform(x_test[' Round'])
使用 TfIdfVectorizer 对文本特征类别进行编码
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer1 = TfidfVectorizer(max_features=500)
vectorizer1.fit(x_train[' Category'])
category_train_enc = vectorizer1.transform(x_train[' Category'])
category_test_enc = vectorizer1.transform(x_test[' Category'])
print(category_train_enc.shape)
print(category_test_enc.shape)
使用 TfIdfVectorizer 对文本特征问题进行编码
vectorizer2 = TfidfVectorizer(max_features=5000)
vectorizer2.fit(x_train[' Question'])
question_train_enc = vectorizer2.transform(x_train[' Question'])
question_test_enc = vectorizer2.transform(x_test[' Question'])
print(question_train_enc.shape)
print(question_test_enc.shape)
使用 TfIdfVectorizer 编码文本特征答案
vectorizer3 = TfidfVectorizer(max_features=1000)
vectorizer3.fit(x_train[' Answer'])
answer_train_enc = vectorizer3.transform(x_train[' Answer'])
answer_test_enc = vectorizer3.transform(x_test[' Answer'])
print(answer_train_enc.shape)
print(answer_test_enc.shape)
结合编码特征:
from scipy.sparse import hstack
x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))
print("Final Data matrix")
print(x_tr.shape, y_train.shape)
print(x_te.shape, y_test.shape)
然后我得到以下错误:
ValueError Traceback (most recent call last)
<ipython-input-60-12e131ba4df4> in <module>
1 # merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
2 from scipy.sparse import hstack
----> 3 x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
4 x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))
5
~\anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
463
464 """
--> 465 return bmat([blocks], format=format, dtype=dtype)
466
467
~\anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
584 exp=brow_lengths[i],
585 got=A.shape[0]))
--> 586 raise ValueError(msg)
587
588 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 145341, expected 1.
请建议我需要对代码进行哪些更改以解决错误。
【问题讨论】:
标签: python numpy scikit-learn scipy