行尺寸不兼容答案

【问题标题】：Incompatible row dimensions行尺寸不兼容
【发布时间】：2021-06-11 22:54:11
【问题描述】：

任务是对所有文本和分类特征进行编码，然后再次将它们组合以形成数据矩阵，但出现错误不兼容的行维度。

到目前为止我的工作：

使用标签编码器对分类特征进行编码

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

enc.fit(x_train[' Round'])

round_train_le = enc.transform(x_train[' Round'])
round_test_le = enc.transform(x_test[' Round'])

使用 TfIdfVectorizer 对文本特征类别进行编码

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer1 = TfidfVectorizer(max_features=500)

vectorizer1.fit(x_train[' Category'])

category_train_enc = vectorizer1.transform(x_train[' Category'])
category_test_enc = vectorizer1.transform(x_test[' Category'])

print(category_train_enc.shape)
print(category_test_enc.shape)

使用 TfIdfVectorizer 对文本特征问题进行编码

vectorizer2 = TfidfVectorizer(max_features=5000)

vectorizer2.fit(x_train[' Question'])

question_train_enc = vectorizer2.transform(x_train[' Question'])
question_test_enc = vectorizer2.transform(x_test[' Question'])

print(question_train_enc.shape)
print(question_test_enc.shape)

使用 TfIdfVectorizer 编码文本特征答案

vectorizer3 = TfidfVectorizer(max_features=1000)

vectorizer3.fit(x_train[' Answer'])

answer_train_enc = vectorizer3.transform(x_train[' Answer'])
answer_test_enc = vectorizer3.transform(x_test[' Answer'])

print(answer_train_enc.shape)
print(answer_test_enc.shape)

结合编码特征：

from scipy.sparse import hstack
x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))

print("Final Data matrix")
print(x_tr.shape, y_train.shape)
print(x_te.shape, y_test.shape)

然后我得到以下错误：

ValueError                                Traceback (most recent call last)
<ipython-input-60-12e131ba4df4> in <module>
      1 # merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
      2 from scipy.sparse import hstack
----> 3 x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
      4 x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))
      5 

~\anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
    463 
    464     """
--> 465     return bmat([blocks], format=format, dtype=dtype)
    466 
    467 

~\anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
    584                                                     exp=brow_lengths[i],
    585                                                     got=A.shape[0]))
--> 586                     raise ValueError(msg)
    587 
    588                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 145341, expected 1.

请建议我需要对代码进行哪些更改以解决错误。

【问题讨论】：

标签： python numpy scikit-learn scipy

【解决方案1】：

当使用scipy.sparse.hstack() 时，您必须确保您尝试堆叠的所有元素都具有相同的 0 维度，即相同的行数。请参阅以下示例：

import numpy as np
from scipy.sparse import hstack

a = np.array([1, 2, 3, 4, 5])
b = np.array([1, 2, 3, 5])

c = hstack([a, b])
print(c)

输出：

 (0, 0) 1
  (0, 1)    2
  (0, 2)    3
  (0, 3)    4
  (0, 4)    5
  (0, 5)    1
  (0, 6)    2
  (0, 7)    3
  (0, 8)    5

另一方面，当行数不匹配时 - 它会导致您收到错误：

import numpy as np
from scipy.sparse import hstack

a = np.array([1, 2, 3, 4, 5, 6])
b = np.array([[1, 2, 3], [4, 5, 6]])

c = hstack([a, b])
print(c)

输出：

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 2.

因此，您应该检查所有项目的行数是否相同，以便逐行加入它们

干杯。

【讨论】：

感谢您如此清晰的解释。
我的荣幸。干杯。