Python，sklearn：使用 MinMaxScaler 和 SVC 的管道操作顺序答案

【问题标题】：Python, sklearn: Order of Pipeline operation with MinMaxScaler and SVCPython，sklearn：使用 MinMaxScaler 和 SVC 的管道操作顺序
【发布时间】：2016-08-03 18:01:00
【问题描述】：

我有一个数据集，我想在其上运行 sklearn SVM 的 SVC 模型。某些特征值的大小在 [0, 1e+7] 范围内。我尝试使用 SVC 不进行预处理，但我得到的计算时间长得令人无法接受，或者得到了 0 个真正的肯定预测。因此，我正在尝试实施预处理步骤，尤其是MinMaxScaler。

到目前为止我的代码：

selection_KBest = SelectKBest()
selection_PCA = PCA()
combined_features = FeatureUnion([("pca", selection_PCA), 
                                  ("univ_select", selection_KBest)])
param_grid = dict(features__pca__n_components = range(feature_min,feature_max),
                  features__univ_select__k = range(feature_min,feature_max))
svm = SVC()            
pipeline = Pipeline([("features", combined_features), 
                     ("scale", MinMaxScaler(feature_range=(0, 1))),
                     ("svm", svm)])
param_grid["svm__C"] = [0.1, 1, 10]
cv = StratifiedShuffleSplit(y = labels_train, 
                            n_iter = 10, 
                            test_size = 0.1, 
                            random_state = 42)
grid_search = GridSearchCV(pipeline,
                           param_grid = param_grid, 
                           verbose = 1,
                           cv = cv)
grid_search.fit(features_train, labels_train)
"(grid_search.best_estimator_): ", (grid_search.best_estimator_)

我的问题是针对行的：

pipeline = Pipeline([("features", combined_features), 
                     ("scale", MinMaxScaler(feature_range=(0, 1))),
                     ("svm", svm)])

我想知道最适合我的程序的逻辑是什么，以及pipeline 中features、scale、svm 的顺序。具体来说，我无法决定是否应该将 features 和 scale 从现在的状态切换。

注意 1：我想使用 grid_search.best_estimator_ 作为我的分类器模型，以进行预测。

注意 2： 我关心的是制定 pipeline 的正确方法，以便在预测步骤中，从训练步骤中的完成方式中选择特征并进行缩放。

注意 3： 我注意到 svm 没有出现在我的 grid_search.best_estimator_ 结果中。这是否意味着它没有被正确调用？

以下是一些表明顺序可能很重要的结果：

pipeline = Pipeline([("scale", MinMaxScaler(feature_range=(0, 1))),
                     ("features", combined_features), 
                     ("svm", svm)]):

Pipeline(steps=[('scale', MinMaxScaler(copy=True, feature_range=(0, 1)))
('features', FeatureUnion(n_jobs=1, transformer_list=[('pca', PCA(copy=True, 
n_components=11, whiten=False)), ('univ_select', SelectKBest(k=2, 
score_func=<function f_classif at 0x000000001ED61208>))], 
transformer_weights=...f', max_iter=-1, probability=False, 
random_state=None, shrinking=True, tol=0.001, verbose=False))])

Accuracy: 0.86247   Precision: 0.38947  Recall: 0.05550 
F1: 0.09716 F2: 0.06699 Total predictions: 15000    
True positives:  111    False positives:  174   
False negatives: 1889   True negatives: 12826


pipeline = Pipeline([("features", combined_features),
                     ("scale", MinMaxScaler(feature_range=(0, 1))), 
                     ("svm", svm)]):

Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('pca', PCA(copy=True, n_components=1, whiten=False)), 
('univ_select', SelectKBest(k=1, score_func=<function f_classif at   
0x000000001ED61208>))],
transformer_weights=None)), ('scale', MinMaxScaler(copy=True, feature_range=
(0,...f', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])

Accuracy: 0.86680   Precision: 0.50463  Recall: 0.05450 
F1: 0.09838 F2: 0.06633 Total predictions: 15000    
True positives:  109    False positives:  107   
False negatives: 1891   True negatives: 12893

编辑 1 16041310： 注 3 已解决。使用grid_search.best_estimator_.steps 获取完整步骤。

【问题讨论】：

SVM 在那里，但似乎被任何将... 放在输出中的东西隐藏了。 max_iter=1, probability=False 是 SVC 的参数。
感谢@joeln 的提示。您知道如何获得完整的、未截断的打印件吗？
目前无法获得完整的未截断打印输出，我意识到：它是硬编码在BaseEstimator.__repr__:github.com/scikit-learn/scikit-learn/blob/master/sklearn/… 中的。当然，您可以单独重复每个步骤...
下面@maxymoo 的回答似乎帮助我们揭示了完整的打印输出，即“grid_search.best_estimator_.steps”（未截断）与“grid_search.best_estimator_”（截断）。
这只是因为每个步骤的长度小于 500 个字符。

标签： python machine-learning scikit-learn svm pipeline

【解决方案1】：

GridsearchCV 中有一个参数refit（默认为True），这意味着最好的估计器将根据完整的数据集重新拟合；然后，您将使用 best_estimator_ 或仅使用 GridsearchCV 对象上的 fit 方法访问此估算器。

best_estimator_ 将是完整的管道，如果您在其上调用 predict，您将获得与训练阶段相同的预处理步骤。

如果你想打印出所有的步骤，你可以这样做

print(grid_search.best_estimator_.steps)

或

for step in grid_search.best_estimator_.steps:
    print(type(step))
    print(step.get_params())

【讨论】：

谢谢@maxymoo。但是，我仍然不确定我的主要困境：pipeline 中features、scale、svm 的顺序的最佳逻辑是什么？
好吧，我会说你通过尝试两者并比较准确性做了正确的事情......从你的结果看来，这并不重要......准确度为 0.4%意义不大。但是，您也可以选择准确性更高的那个？
我同意在这种情况下，准确性上的微小差异比替代方案更好，并且令人欣慰。但是，如果差异很大，那么管道的顺序就会很重要，因此正确的管道顺序实施和理解将是至关重要的！这是提出这个问题的重点。