【发布时间】:2020-08-19 05:03:22
【问题描述】:
我正在尝试联合两条管道:
- pipeline_1 返回一个 float64 的稀疏矩阵
- pipeline_2 以 pandas DataFrame 的形式返回原始列 (str)(一个 Series 不会导致错误 ValueError: blocks[0,:] has incompatible row dimensions. em>)
执行此操作时,我收到错误:
TypeError:不支持类型转换:(dtype('int64'), dtype('O'))
我的目标是找到一种通用方法,将 DataFrame 的原始列保留在管道中,以供分类器稍后使用。
代码:
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, key, transform_function=None):
self.key = key
self.transform_function = transform_function
def fit(self, X, y=None, *parg, **kwarg):
return self
def transform(self, X):
result = X[self.key]
if self.transform_function:
result = self.transform_function(result)
return result
data = [
{'col1': 'hello my friend', 'col2': 'somestring_'},
{'col1': 'my friend', 'col2': 'somestring__'},
{'col1': 'hello friend', 'col2': 'somestring___'}
]
df = pd.DataFrame(data)
pipeline_1 = Pipeline([
('selector', ColumnSelector(key='col1')),
('vectorizer', CountVectorizer())
])
pipeline_2 = Pipeline([
('test', ColumnSelector(key='col2'))#, transform_function=lambda col: col.to_frame())),
])
feats = FeatureUnion([('count_vectorize', pipeline_1), ('original_column', pipeline_2)])
feats.fit_transform(df)
【问题讨论】:
标签: python types scikit-learn