【发布时间】:2017-12-16 08:19:04
【问题描述】:
我目前正试图根据一堆整数和一些文本特征来预测一个 kickstarter 项目是否会成功。我正在考虑构建一个看起来像这样的管道
这是我的 ItemSelector 和管道代码
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, keys):
self.keys = keys
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.keys]
我验证了 ItemSelector 正在按预期工作
t = ItemSelector(['cleaned_text'])
t.transform(df)
And it extract the necessary columns
管道
pipeline = Pipeline([
# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the post's subject line
('text', Pipeline([
('selector', ItemSelector(['cleaned_text'])),
('counts', CountVectorizer()),
('tf_idf', TfidfTransformer())
])),
# Pipeline for pulling ad hoc features from post's body
('integer_features', ItemSelector(int_features)),
]
)),
# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
])
但是当我运行 pipeline.fit(X_train, y_train) 时,我收到了这个错误。知道如何解决这个问题吗?
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-317e1c402966> in <module>()
----> 1 pipeline.fit(X_train, y_train)
~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
266 This estimator
267 """
--> 268 Xt, fit_params = self._fit(X, y, **fit_params)
269 if self._final_estimator is not None:
270 self._final_estimator.fit(Xt, y, **fit_params)
~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
232 pass
233 elif hasattr(transform, "fit_transform"):
--> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
235 else:
236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
740 self._update_transformer_list(transformers)
741 if any(sparse.issparse(f) for f in Xs):
--> 742 Xs = sparse.hstack(Xs).tocsr()
743 else:
744 Xs = np.hstack(Xs)
~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
456
457 """
--> 458 return bmat([blocks], format=format, dtype=dtype)
459
460
~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
577 exp=brow_lengths[i],
578 got=A.shape[0]))
--> 579 raise ValueError(msg)
580
581 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 81096, expected 1.
【问题讨论】:
-
您应该发布完整的错误堆栈跟踪。您也可以单独使用 TfidfVectorizer 代替 CountVectorizer 和 TfidfTransformer。还有一件事,确保 ItemSelector 返回的数据是二维形状(n_samples,n_features)。
-
你能发布一些重现错误的示例数据吗?
-
另外,
integer_featuresItemSelector 的输出形状是什么?好像有问题 -
这些是测试列车分裂前的形状
(108129, 7).
标签: python pandas numpy machine-learning scikit-learn