【问题标题】:Problem when branching Sklearn Pipiline into a GridSearchCV将 Sklearn Pipiline 分支到 GridSearchCV 时出现问题
【发布时间】:2020-08-08 10:05:07
【问题描述】:

我正在尝试使用自己的功能构建管道。为此,我从 sklearn base 继承了 BaseEstimator 和 TransformerMixin,并定义了自己的变换方法。

当我执行 pipeline.fit(X,y) 时,它工作正常。

问题是当我尝试使用管道创建 GridSearchCV 对象时。我收到以下错误: ValueError: 操作数无法与形状 (730,36) (228,) (730,36) 一起广播。

730 就是矩阵 X 的行数除以 'cv' = 2,我在 GridSearchCV 中为交叉验证选择的折叠数。

我不知道如何调试它。我在我的函数中间尝试了一些打印,结果很奇怪。

我正在附加我创建的函数以及管道。如果有人可以提供帮助,我会非常高兴。

这是我为管道创建的函数:

from sklearn.base import BaseEstimator, TransformerMixin
class MissingData(BaseEstimator, TransformerMixin):


    def fit( self, X, y = None  ):
        return self

    def transform(self, X , y = None, strategies = ( "most_frequent", "mean") ):
        print('Started MissingData')
        X_ = X.copy()

        #Categorical Variables handling
        categorical_variables = list(X_.select_dtypes(include=['category','object']))
        imp_category = SimpleImputer(strategy = strategies[0])
        X_[categorical_variables] = pd.DataFrame(imp_category.fit_transform(X_[categorical_variables]))


        #Numeric varialbes handling
        numerical_variables = list(set(X_.columns) - set(categorical_variables))
        imp_numerical = SimpleImputer(strategy = strategies[1])
        X_[numerical_variables] = pd.DataFrame(imp_numerical.fit_transform(X_[numerical_variables]))
        print('Finished MissingData')


        print('Inf: ',X_.isnull().sum().sum())
        return X_

class OHEncode(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None  ):
        return self

    def encode_and_drop_original_and_first_dummy(self,df, feature_to_encode):
        dummies = pd.get_dummies(df[feature_to_encode] , prefix = feature_to_encode, drop_first=True) #Drop first equals true will take care of the dummies variables trap
        res = pd.concat([df, dummies], axis=1)
        res = res.drop([feature_to_encode], axis=1)
        return(res) 

    def transform(self, X , y = None, categorical_variables  = None ):
        X_ = X.copy()
        if categorical_variables == None:
            categorical_variables  = list(X_.select_dtypes(include=['category','object']))
        print('Started Encoding')
        #Let's update the matrix X with the one hot ecoded version of all features in categorical_variables
        for feature_to_encode in categorical_variables:
            X_ = self.encode_and_drop_original_and_first_dummy(X_ , feature_to_encode)
        print('Finished Encoding')
        print('Inf: ',X_.isnull().sum().sum())
        return X_

这是带有 GridSearchCV 的管道:

pca = PCA(n_components=10)
pipeline = Pipeline([('MissingData', MissingData()), ('OHEncode', OHEncode()) , 
          ('scaler', StandardScaler()) , ('pca', pca), ('rf', LinearRegression())])

parameters = {'pca__n_components': [5, 15, 30, 45, 64]}

grid = GridSearchCV(pipeline, param_grid=parameters, cv = 2)
grid.fit(X, y)

最后是完整的输出,包括我的打印和错误:

Started MissingData
Finished MissingData
Inf:  57670
Started Encoding
Finished Encoding
Inf:  26280
Started MissingData
Finished MissingData
Inf:  0
Started Encoding
C:\Users\menoci\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\extmath.py:765: RuntimeWarning: invalid value encountered in true_divide
  updated_mean = (last_sum + new_sum) / updated_sample_count
C:\Users\menoci\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\extmath.py:706: RuntimeWarning: Degrees of freedom <= 0 for slice.
  result = op(x, *args, **kwargs)
C:\Users\menoci\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

  FitFailedWarning)
Finished Encoding
Inf:  0
Started MissingData
Finished MissingData
Inf:  57670
Started Encoding
Finished Encoding
Inf:  26280
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-67-f78b56dad89d> in <module>
     15 
     16 #pipeline.set_params(rf__n_estimators = 50)
---> 17 grid.fit(X, y)
     18 
     19 #rf_val_predictions = pipeline.predict(X)

~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    710                 return results
    711 
--> 712             self._run_search(evaluate_candidates)
    713 
    714         # For multi-metric evaluation, store the best_index_, best_params_ and

~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
   1151     def _run_search(self, evaluate_candidates):
   1152         """Search all candidates in param_grid"""
-> 1153         evaluate_candidates(ParameterGrid(self.param_grid))
   1154 
   1155 

~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params)
    689                                for parameters, (train, test)
    690                                in product(candidate_params,
--> 691                                           cv.split(X, y, groups)))
    692 
    693                 if len(out) < 1:

~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in __call__(self, iterable)
   1005                 self._iterating = self._original_iterator is not None
   1006 
-> 1007             while self.dispatch_one_batch(iterator):
   1008                 pass
   1009 

~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    833                 return False
    834             else:
--> 835                 self._dispatch(tasks)
    836                 return True
    837 

~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in _dispatch(self, batch)
    752         with self._lock:
    753             job_idx = len(self._jobs)
--> 754             job = self._backend.apply_async(batch, callback=cb)
    755             # A job can complete so quickly than its callback is
    756             # called before we get here, causing self._jobs to

~\AppData\Roaming\Python\Python37\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    207     def apply_async(self, func, callback=None):
    208         """Schedule a func to be run"""
--> 209         result = ImmediateResult(func)
    210         if callback:
    211             callback(result)

~\AppData\Roaming\Python\Python37\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    588         # Don't delay the application, to avoid keeping the input
    589         # arguments in memory
--> 590         self.results = batch()
    591 
    592     def get(self):

~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in __call__(self)
    254         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    255             return [func(*args, **kwargs)
--> 256                     for func, args, kwargs in self.items]
    257 
    258     def __len__(self):

~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in <listcomp>(.0)
    254         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    255             return [func(*args, **kwargs)
--> 256                     for func, args, kwargs in self.items]
    257 
    258     def __len__(self):

~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    542     else:
    543         fit_time = time.time() - start_time
--> 544         test_scores = _score(estimator, X_test, y_test, scorer)
    545         score_time = time.time() - start_time - fit_time
    546         if return_train_score:

~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_validation.py in _score(estimator, X_test, y_test, scorer)
    589         scores = scorer(estimator, X_test)
    590     else:
--> 591         scores = scorer(estimator, X_test, y_test)
    592 
    593     error_msg = ("scoring must return a number, got %s (%s) "

~\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\_scorer.py in __call__(self, estimator, *args, **kwargs)
     87                                       *args, **kwargs)
     88             else:
---> 89                 score = scorer(estimator, *args, **kwargs)
     90             scores[name] = score
     91         return scores

~\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\_scorer.py in _passthrough_scorer(estimator, *args, **kwargs)
    369 def _passthrough_scorer(estimator, *args, **kwargs):
    370     """Function that wraps estimator.score"""
--> 371     return estimator.score(*args, **kwargs)
    372 
    373 

~\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

~\AppData\Roaming\Python\Python37\site-packages\sklearn\pipeline.py in score(self, X, y, sample_weight)
    611         Xt = X
    612         for _, name, transform in self._iter(with_final=False):
--> 613             Xt = transform.transform(Xt)
    614         score_params = {}
    615         if sample_weight is not None:

~\AppData\Roaming\Python\Python37\site-packages\sklearn\preprocessing\_data.py in transform(self, X, copy)
    804         else:
    805             if self.with_mean:
--> 806                 X -= self.mean_
    807             if self.with_std:
    808                 X /= self.scale_

ValueError: operands could not be broadcast together with shapes (730,36) (228,) (730,36) 

【问题讨论】:

  • 您的每个转换器都会生成一个具有不同维度的数组。因此,我建议您独立获取每个结果并检查输出尺寸(例如x = MissingData() 然后x.fit(...)。
  • 感谢您的回复,@Ghanem。如果是这样的话,单独的 pipeline.fit 不应该工作,对吧?但它有效。单独执行转换的工作原理如下: X = MissingData().transform(X); X = OHEncode().transform(X); X = StandardScaler().fit_transform(X); X = pca.fit_transform(X); rf = 线性回归(); rf.fit(X,y) 在这种情况下你推荐什么?谢谢!
  • 很难知道给出以下信息。我建议您尝试以下操作:1)将PCA(n_components=10) 移动到管道内部并检查它是否有效。 2)去掉StandardScalerStandardScaler再申请GS,下一步去掉OHEncode并测试。
  • 问题确实出在 OHEncode 上。它与 PCA 或 StandardScaler 无关。我想我知道原因:在 OHEncode 中,我对所有分类特征进行了热编码。问题在于,自从 Cross Val 以来。只使用一部分数据进行训练,有可能一些分类值不会出现在训练中,因此不会被编码,所以当我们尝试预测它们时会出现问题。你对如何处理有什么建议吗?我可能不是第一个面临这个问题的人。我应该放弃使用 Pipeline 进行这部分处理吗?
  • 很好,现在问题很清楚了。我建议您编辑您的问题并添加有关错误的这些最终详细信息,以便更好地存档;)。

标签: python scikit-learn pipeline gridsearchcv


【解决方案1】:

第一点,我希望你使用 sklearn 中的 OneHotEncoder (OHE) 类。然后,在OHEncode 的构造函数中定义一个 OHE 对象并将其与您拥有的所有分类值相匹配(以使它们在每次 GridSearch 迭代中“可见”)。然后在OHEncodetransform 函数中,使用OHE 的对象应用变换。

不要将 OHE 对象放入 fit 函数中,因为那样您将遇到相同的错误;在每次 GridSearch 迭代时,都会应用 fit 和 transform 函数。

【讨论】:

  • 好主意。不过有一个问题。来自 sklearn 的 OneHotEncoder 不处理缺失值。而且我开始认为我真的不知道 Pipeline 是如何工作的。当我尝试您的建议时,我遇到了以下问题:
  • 作品:X_ = MissingData().transform(X) ; pipeline = Pipeline([('OHEncode', OHEncode(X_)),('scaler', StandardScaler()) , ('reduce_dim', PCA()), ('rf', RandomForestRegressor(random_state=1))])有效,但以下无效: X_ = MissingData().transform(X) ; pipeline = Pipeline([('missingdata', MissingData()),('OHEncode', OHEncode(X_)),('scaler', StandardScaler()) , ('reduce_dim', PCA()), ('rf' , RandomForestRegressor(random_state=1))]) 如果我已经估算了缺失值,将 MissingValues 添加到我的管道应该没有任何区别,但我收到错误:'Input contains NaN'
  • 也使用 sklearn imputer 估算丢失的数据,这里是:scikit-learn.org/stable/modules/generated/…OHEncode 的构造函数中创建一个管道,同时使用 OneHotEncoderSimpleImputer 并仅适合它.. 稍后应用变换
  • 我已经在使用 SimpleImputer。为什么我应该在 OHEncode 构造函数中同时进行 One Hot 编码和 Imputing?您对为什么我的方法不起作用有任何想法吗?据我测试,我的 MissingData() 函数运行良好,它总是返回没有任何任务价值的数据。
  • 我没有在你的代码中注意到这一点。您是否尝试将 OneHotEncoder 放入构造函数中? .. 遵循我的解决方案?
猜你喜欢
  • 2021-01-09
  • 2015-04-22
  • 1970-01-01
  • 2021-05-30
  • 2019-09-22
  • 2018-10-15
  • 2021-11-18
  • 2018-11-20
  • 1970-01-01
相关资源
最近更新 更多