【问题标题】:Sklearn FeatureUnion returns TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))Sklearn FeatureUnion 返回 TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))
【发布时间】:2020-08-19 05:03:22
【问题描述】:

我正在尝试联合两条管道:

  • pipeline_1 返回一个 float64 的稀疏矩阵
  • pipeline_2 以 pandas DataFrame 的形式返回原始列 (str)(一个 Series 不会导致错误 ValueError: blocks[0,:] has incompatible row dimensions. em>)

执行此操作时,我收到错误:

TypeError:不支持类型转换:(dtype('int64'), dtype('O'))

我的目标是找到一种通用方法,将 DataFrame 的原始列保留在管道中,以供分类器稍后使用。

代码:

import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion


class ColumnSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key, transform_function=None):
        self.key = key
        self.transform_function = transform_function

    def fit(self, X, y=None, *parg, **kwarg):
        return self

    def transform(self, X):
        result = X[self.key]
        if self.transform_function:
            result = self.transform_function(result)
        return result


data = [
    {'col1': 'hello my friend', 'col2': 'somestring_'},
    {'col1': 'my friend', 'col2': 'somestring__'},
    {'col1': 'hello friend', 'col2': 'somestring___'}
]
df = pd.DataFrame(data)



pipeline_1 = Pipeline([
    ('selector', ColumnSelector(key='col1')),
    ('vectorizer', CountVectorizer())
])

pipeline_2 = Pipeline([
    ('test', ColumnSelector(key='col2'))#, transform_function=lambda col: col.to_frame())),
])

feats = FeatureUnion([('count_vectorize', pipeline_1), ('original_column', pipeline_2)])

feats.fit_transform(df)

【问题讨论】:

    标签: python types scikit-learn


    【解决方案1】:

    FeatureUnion 使用 numpy 或 scipy 稀疏运算来加入其中每个特征的输出。因此,您不能在 FeatureUnion 中有任何可以返回非数值的步骤。

    如果我更改您的 pipeline2 以返回给定字符串中的字符数,它将开始工作。

    注意:您可以从sklearn.compose 使用ColumnTransformer

    import pandas as pd
    
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.pipeline import Pipeline, FeatureUnion
    
    
    class ColumnSelector(BaseEstimator, TransformerMixin):
    
        def __init__(self, key, transform_function=None):
            self.key = key
            self.transform_function = transform_function
    
        def fit(self, X, y=None, *parg, **kwarg):
            return self
    
        def transform(self, X):
            result = X[self.key]
            if self.transform_function:
                result = self.transform_function(result)
            return result
    
    
    data = [
        {'col1': 'hello my friend', 'col2': 'somestring_'},
        {'col1': 'my friend', 'col2': 'somestring__'},
        {'col1': 'hello friend', 'col2': 'somestring___'}
    ]
    df = pd.DataFrame(data)
    
    
    
    pipeline_1 = Pipeline([
        ('selector', ColumnSelector(key='col1')),
        ('vectorizer', CountVectorizer())
    ])
    
    pipeline_2 = Pipeline([
        ('test', ColumnSelector(key='col2',transform_function=lambda x: [[len(i)] for i in x]))#, transform_function=lambda col: col.to_frame())),
    ])
    
    feats = FeatureUnion([('count_vectorize', pipeline_1), ('original_column', pipeline_2)])
    
    feats.fit_transform(df)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-11-09
      • 2019-08-19
      • 2014-04-11
      • 2017-02-05
      • 2019-02-24
      • 2019-02-12
      相关资源
      最近更新 更多