【问题标题】:Incompatible row dimensions行尺寸不兼容
【发布时间】:2021-06-11 22:54:11
【问题描述】:

任务是对所有文本和分类特征进行编码,然后再次将它们组合以形成数据矩阵,但出现错误不兼容的行维度。

到目前为止我的工作:

使用标签编码器对分类特征进行编码

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

enc.fit(x_train[' Round'])

round_train_le = enc.transform(x_train[' Round'])
round_test_le = enc.transform(x_test[' Round'])

使用 TfIdfVectorizer 对文本特征类别进行编码

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer1 = TfidfVectorizer(max_features=500)

vectorizer1.fit(x_train[' Category'])

category_train_enc = vectorizer1.transform(x_train[' Category'])
category_test_enc = vectorizer1.transform(x_test[' Category'])

print(category_train_enc.shape)
print(category_test_enc.shape)

使用 TfIdfVectorizer 对文本特征问题进行编码

vectorizer2 = TfidfVectorizer(max_features=5000)

vectorizer2.fit(x_train[' Question'])

question_train_enc = vectorizer2.transform(x_train[' Question'])
question_test_enc = vectorizer2.transform(x_test[' Question'])

print(question_train_enc.shape)
print(question_test_enc.shape)

使用 TfIdfVectorizer 编码文本特征答案

vectorizer3 = TfidfVectorizer(max_features=1000)

vectorizer3.fit(x_train[' Answer'])

answer_train_enc = vectorizer3.transform(x_train[' Answer'])
answer_test_enc = vectorizer3.transform(x_test[' Answer'])

print(answer_train_enc.shape)
print(answer_test_enc.shape)

结合编码特征:

from scipy.sparse import hstack
x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))

print("Final Data matrix")
print(x_tr.shape, y_train.shape)
print(x_te.shape, y_test.shape)

然后我得到以下错误:

ValueError                                Traceback (most recent call last)
<ipython-input-60-12e131ba4df4> in <module>
      1 # merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
      2 from scipy.sparse import hstack
----> 3 x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
      4 x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))
      5 

~\anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
    463 
    464     """
--> 465     return bmat([blocks], format=format, dtype=dtype)
    466 
    467 

~\anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
    584                                                     exp=brow_lengths[i],
    585                                                     got=A.shape[0]))
--> 586                     raise ValueError(msg)
    587 
    588                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 145341, expected 1.

请建议我需要对代码进行哪些更改以解决错误。

【问题讨论】:

    标签: python numpy scikit-learn scipy


    【解决方案1】:

    当使用scipy.sparse.hstack() 时,您必须确保您尝试堆叠的所有元素都具有相同的 0 维度,即相同的行数。请参阅以下示例:

    import numpy as np
    from scipy.sparse import hstack
    
    a = np.array([1, 2, 3, 4, 5])
    b = np.array([1, 2, 3, 5])
    
    c = hstack([a, b])
    print(c)
    

    输出:

     (0, 0) 1
      (0, 1)    2
      (0, 2)    3
      (0, 3)    4
      (0, 4)    5
      (0, 5)    1
      (0, 6)    2
      (0, 7)    3
      (0, 8)    5
    

    另一方面,当行数不匹配时 - 它会导致您收到错误:

    import numpy as np
    from scipy.sparse import hstack
    
    a = np.array([1, 2, 3, 4, 5, 6])
    b = np.array([[1, 2, 3], [4, 5, 6]])
    
    c = hstack([a, b])
    print(c)
    
    

    输出:

    ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 2.
    

    因此,您应该检查所有项目的行数是否相同,以便逐行加入它们

    干杯。

    【讨论】:

    • 感谢您如此清晰的解释。
    • 我的荣幸。干杯。
    猜你喜欢
    • 2016-08-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-01-18
    • 1970-01-01
    • 2020-02-13
    • 2013-08-21
    • 1970-01-01
    相关资源
    最近更新 更多