Sklearn：异质特征的FeatureUnion与管道中的分类器产生不兼容的行尺寸错误答案

【问题标题】：Sklearn: FeatureUnion of heterogenous features gives incompatible row dimensions error with classifier in the pipelineSklearn：异质特征的FeatureUnion与管道中的分类器产生不兼容的行尺寸错误
【发布时间】：2017-12-24 22:26:14
【问题描述】：

我想根据我拥有的不同特征（文本和数字）进行二进制分类。训练数据是熊猫数据框的形式。我的管道看起来像这样：

final_pipeline = Pipeline([('union', FeatureUnion(
                transformer_list=[('body_trans', Pipeline([('selector', ItemSelector(key='body')),
                                                          ('count_vect', CountVectorizer())])),
                                  ('body_trans2', Pipeline([('selector', ItemSelector(key='body2')),
                                                          ('count_vect', TfidfVectorizer())])),
                                 ('length_trans', Pipeline([('selector', ItemSelector(key='length')),
                                                           ('min_max_scaler',  MinMaxScaler())]))],
                transformer_weights={'body_trans': 1.0,'body_trans2': 1.0,'length_trans': 1.0})),
                          ('svc', SVC())])

ItemSelector 看起来像这样：

class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_frame):
        return data_frame[[self.key]]

现在，当我尝试final_pipeline.fit(X_train, y_train) 时，它给了我ValueError: blocks[0,:] has incompatible row dimensions 异常。

X_train, X_test, y_train, y_test = train_test_split(train_set, target_set)

是我获取训练数据的方式。 train_set 是一个数据框，其中包含 body、body2、length 等字段。target_set 是一个只有一个名为 label 的字段的数据框，这是我要分类的实际标签。

编辑：

我认为我输入到管道的数据格式不正确。

train_set 是我的带有特征的训练数据，样本：

   body           length  body2
0  blah-blah      193     blah-blah-2
1  blah-blah-blah 153     blah-blah-blah-2

还有target_set，这是带有分类标签的DataFrame

  label
0  True
1  False

如果有关于使用 DataFrames 的 Pipeline 拟合参数的输入格式的教程，请给我一个链接！我找不到关于如何在使用多列作为单独功能时将 DataFrames 作为管道的输入加载的适当文档。

感谢任何帮助！

【问题讨论】：

请发布一些示例数据以及易于复制和运行的代码以及错误的完整堆栈跟踪。
添加了一些数据样本！谢谢
问题出在您的 ItemSelector 中。它输出一个二维数据帧，但 CountVectorizer 和 TfidfVectorizer 需要一个一维字符串数组。

标签： python pandas scikit-learn classification feature-extraction

【解决方案1】：

问题出在您的 ItemSelector 中。它输出一个二维数据帧，但 CountVectorizer 和 TfidfVectorizer 需要一个一维字符串数组。

显示ItemSelector输出的代码：-

import numpy as np
from pandas import DataFrame
df = DataFrame(columns = ['body','length','body2'],data=np.array([['blah-blah', 193, 'blah-blah-2'],['blah-blah-2', 153, 'blah-blah-blah-2'] ]))

body_selector = ItemSelector(key='body')
df_body = body_selector.fit_transform(df)

df_body.shape
# (2,1)

您可以定义另一个类，该类可以将数据以正确的形式呈现给下一步。

像这样将此类添加到您的代码中：

class Converter(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return self

    def transform(self, data_frame):
        return data_frame.values.ravel()

然后像这样定义你的管道：

final_pipeline = Pipeline([('union', FeatureUnion(
                transformer_list=[('body_trans', Pipeline([('selector', ItemSelector(key='body')),
                                                           ('converter', Converter()),
                                                          ('count_vect', CountVectorizer())])),
                                  ('body_trans2', Pipeline([('selector', ItemSelector(key='body2')),
                                                            ('converter', Converter()),
                                                          ('count_vect', TfidfVectorizer())])),
                                 ('length_trans', Pipeline([('selector', ItemSelector(key='length')),
                                                           ('min_max_scaler',  MinMaxScaler())]))],
                transformer_weights={'body_trans': 1.0,'body_trans2': 1.0,'length_trans': 1.0})),
                          ('svc', SVC())])

无需将其添加到第三部分，因为 MinMaxScalar 需要二维输入数据。

如有任何问题，请随时询问。

【讨论】：

似乎这是问题所在！您还可以带我完成调试此问题所采取的步骤吗？我发现很难在 Python 中调试此类类型问题。谢谢！
@void 我从以前的经验中知道的一件事是，这个错误来自 FeatureUnion 步骤中不同形状的特征。所以我分解了你的步骤并打印了 FeatureUnion 中每个内部管道的输出形状。在那里，我发现前两个管道正在输出 [1,1]，最后一个是给定演示数据的 [2,1]。然后进一步打破前两个管道以检查输入和输出形状。我发现了问题。