【发布时间】:2018-11-30 03:21:47
【问题描述】:
我有一个 14 列的 DataFrame。我正在使用自定义转换器来
- 从我的 DataFrame 中选择所需的列。这十四列中的五列。
- 选择特定数据类型(分类、对象、整数等)的列
- 根据类型对列执行预处理。
我的自定义 ColumnSelector 转换器是:
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
try:
return X[self.columns]
except KeyError:
cols_error = list(set(self.columns) - set(X.columns))
raise KeyError("The DataFrame does not include the columns: %s" % cols_error)
后跟自定义类型选择器:
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
我从中选择所需列的原始 DataFrame 是 df_with_types 并且有 981 行。下面列出了我希望提取的列以及相应的数据类型;
meeting_subject_stem_sentence : '对象', priority_label_stem_sentence : '对象', 与会者:'类别', day_of_week: '类别', meeting_time_mins: 'int64'
然后我继续按照以下方式构建我的管道
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
("integer_features", make_pipeline(
TypeSelector('int64'),
StandardScaler()
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
OneHotEnc()
)),
("text_features", make_pipeline(
TypeSelector("object"),
TfidfVectorizer(stop_words=stopWords)
))
])
)
当我将管道拟合到数据时抛出的错误是:
preprocess_pipeline.fit_transform(df_with_types)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 2, expected 981.
我有一种预感,因为 TFIDF 矢量化器正在发生这种情况。仅在没有 FeatureUnion 的 TFIDF 矢量化器上进行拟合...
the_pipe = Pipeline([('col_sel', ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence'])),
('type_selector', TypeSelector('object')), ('tfidf', TfidfVectorizer())])
当我安装_pipe时:
a = the_pipe.fit_transform(df_with_types)
这给了我一个 2*2 矩阵而不是 981。
(0, 0) 1.0
(1, 1) 1.0
使用named_steps调用特征名称属性,我得到
the_pipe.named_steps['tfidf'].get_feature_names()
[u'meeting_subject_stem_sentence', u'priority_label_stem_sentence']
它似乎只适合列名而不是遍历文档。我如何在上述管道中实现这一点。此外,如果我想在 ColumnSelector 和 TypeSelector 之后将成对距离/相似度函数作为管道的一部分应用于每个特征,我必须做什么。
一个例子是......
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
("integer_features", make_pipeline(
TypeSelector('int64'),
StandardScaler(),
'Pairwise manhattan distance between each element of the integer feature'
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
OneHotEnc(),
'Pairwise dice coefficient here'
)),
("text_features", make_pipeline(
TypeSelector("object"),
TfidfVectorizer(stop_words=stopWords),
'Pairwise cosine similarity here'
))
])
)
请帮忙。作为一个初学者,我一直在为此绞尽脑汁,但无济于事。我经历了zac_stewart's blog 和许多其他类似的,但似乎没有人谈论如何将 TFIDF 与 TypeSelector 或 ColumnSelector 一起使用。 非常感谢您提供的所有帮助。希望我清楚地提出了这个问题。
编辑 1:
如果我使用 TextSelector 转换器,如下所示...
class TextSelector(BaseEstimator, TransformerMixin):
""" Transformer that selects text column from DataFrame by key."""
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
'''Create X attribute to be transformed'''
return self
def transform(self, X, y=None):
'''the key passed here indicates column name'''
return X[self.key]
text_processing_pipe_line_1 = 管道([('selector', TextSelector(key='meeting_subject')), ('text_1', TfidfVectorizer(stop_words=stopWords))])
t = text_processing_pipe_line_1.fit_transform(df_with_types)
(0, 656) 0.378616399898
(0, 75) 0.378616399898
(0, 117) 0.519159384271
(0, 545) 0.512337545421
(0, 223) 0.425773433566
(1, 154) 0.5
(1, 137) 0.5
(1, 23) 0.5
(1, 355) 0.5
(2, 656) 0.497937369182
这是可行的,它正在遍历文档,因此如果我可以让 TypeSelector 返回一个系列,那对吗?再次感谢您的帮助。
【问题讨论】:
标签: python scikit-learn pipeline