【问题标题】:List of parameters in sklearn randomizedSearchCV like GridSearchCV?sklearn randomSearchCV 中的参数列表,如 GridSearchCV?
【发布时间】:2017-08-23 12:49:03
【问题描述】:

我有一个问题,我想测试多个模型,但这些模型并不都具有相同的命名参数。您将如何使用RandomizedSearchCV 中的管道参数列表,就像您可以在此示例中使用GridSearchCV 一样?

示例来自:
https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2

pipe = Pipeline([
    # the reduce_dim stage is populated by the param_grid
    ('reduce_dim', None),
    ('classify', LinearSVC())
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        'reduce_dim': [PCA(iterated_power=7), NMF()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
    {
        'reduce_dim': [SelectKBest(chi2)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]

grid = GridSearchCV(pipe, cv=3, n_jobs=2, param_grid=param_grid)
digits = load_digits()
grid.fit(digits.data, digits.target)

【问题讨论】:

  • 你找到解决办法了吗?
  • 不幸的是,我从来没有发现一个已经实现的。不过,现在对我来说实现自我似乎不那么困难了。需要创建一个接受输入参数字典的函数(可能需要一个带有键的字典,每个模型的键值是模型参数的字典)返回 cv 分数。您可能希望首先设置 cv 训练/测试集,以便每个实验使用相同的数据。然后我认为你只需要为参数的随机排列创建一个迭代器并调用 eval 函数,存储结果。
  • “我想测试多个模型,它们的命名参数并不相同。”您的示例代码没有演示这个要求。
  • 我明白了。您想搜索不同的变形金刚。我这样做的方法是为具有布尔enabled 参数的变形金刚制作包装类。然后将它们全部包含在管道中。如果未启用转换器包装器,则 fittransform 什么也不做。如果你愿意,我可以发布代码。

标签: python scikit-learn


【解决方案1】:

我找到了一种方法,它依赖于鸭式打字,并且不会受到太多干扰。

它依赖于将完整的估算器作为参数传递给管道。我们首先对模型类型进行采样,然后对其参数进行采样。为此,我们定义了两个可以采样的类:

from sklearn.model_selection import ParameterSampler


class EstimatorSampler:
    """
    Class that holds a model and its parameters distribution.
    When sampled, the parameters are first sampled and set to the model, 
    which is returned.

    # Arguments
    ===========
    model : sklearn.base.BaseEstimator
    param_distributions : dict
        Input to ParameterSampler

    # Returns
    =========
    sampled : sklearn.base.BaseEstimator
    """
    def __init__(self, model, param_distributions):
        self.model = model
        self.param_distributions = param_distributions

    def rvs(self, random_state=None):
        sampled_params = next(iter(
            ParameterSampler(self.param_distributions, 
                             n_iter=1, 
                             random_state=random_state)))
        return self.model.set_params(**sampled_params)


class ListSampler:
    """
    List container that when sampled, returns one of its item, 
    with probabilities defined by `probs`.

    # Arguments
    ===========
    items : 1-D array-like
    probs : 1-D array-like of floats
        If not None, it should be the same length of `items`
        and sum to 1.

    # Returns
    =========
    sampled item
    """
    def __init__(self, items, probs=None):
        self.items = items
        self.probs = probs

    def rvs(self, random_state=None):
        item = np.random.choice(self.items, p=self.probs)
        if hasattr(item, 'rvs'):
            return item.rvs(random_state=random_state)
        return item

其余代码定义如下。

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_digits
    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.pipeline import Pipeline
    from sklearn.svm import LinearSVC
    from sklearn.decomposition import PCA, NMF
    from sklearn.feature_selection import SelectKBest, chi2

    pipe = Pipeline([
        # the reduce_dim stage is populated by the param_grid
        ('reduce_dim', None),
        ('classify', None)
    ])

    N_FEATURES_OPTIONS = [2, 4, 8]
    dim_reducers = ListSampler([EstimatorSampler(est, {'n_components': N_FEATURES_OPTIONS})
                                for est in [PCA(iterated_power=7), NMF()]] + 
                               [EstimatorSampler(SelectKBest(chi2), {'k': N_FEATURES_OPTIONS})])

    C_OPTIONS = [1, 10, 100, 1000]
    classifiers = EstimatorSampler(LinearSVC(), {'C': C_OPTIONS})

    param_dist = {
        'reduce_dim': dim_reducers, 
        'classify': classifiers
    }

    grid = RandomizedSearchCV(pipe, cv=3, n_jobs=2, scoring='accuracy', param_distributions=param_dist)
    digits = load_digits()
    grid.fit(digits.data, digits.target)

【讨论】:

    【解决方案2】:

    Hyperopt 支持跨多个估计器的超参数调整,查看此wiki 了解更多详细信息(2.2 搜索空间示例:scikit-learn 部分)。

    如果您想使用 sklearn 的 GridSearch 来执行此操作,请查看此 post。它建议实现 EstimatorSelectionHelper 估算器,该估算器可以运行不同的估算器,每个估算器都有自己的参数网格。

    【讨论】:

      【解决方案3】:

      这是一个老问题,was resolved 有一段时间了(不确定从哪个 scikit-learn 版本开始)。

      您现在可以在param_distributions 参数中传递RandomizedSearchCV 的字典列表。您的示例代码将变为:

      import numpy as np
      import matplotlib.pyplot as plt
      from sklearn.datasets import load_digits
      from sklearn.model_selection import RandomizedSearchCV
      from sklearn.pipeline import Pipeline
      from sklearn.svm import LinearSVC
      from sklearn.decomposition import PCA, NMF
      from sklearn.feature_selection import SelectKBest, chi2
      
      pipe = Pipeline([
          # the reduce_dim stage is populated by the param_grid
          ('reduce_dim', None),
          ('classify', LinearSVC())
      ])
      
      N_FEATURES_OPTIONS = [2, 4, 8]
      C_OPTIONS = [1, 10, 100, 1000]
      param_grid = [
          {
              'reduce_dim': [PCA(iterated_power=7), NMF()],
              'reduce_dim__n_components': N_FEATURES_OPTIONS,
              'classify__C': C_OPTIONS
          },
          {
              'reduce_dim': [SelectKBest(chi2)],
              'reduce_dim__k': N_FEATURES_OPTIONS,
              'classify__C': C_OPTIONS
          },
      ]
      
      grid = RandomizedSearchCV(pipe, cv=3, n_jobs=2, param_distributions=param_grid)
      digits = load_digits()
      grid.fit(digits.data, digits.target)
      

      我使用的是 sklearn 0.23.1 版。

      【讨论】:

        猜你喜欢
        • 2020-01-16
        • 2018-03-21
        • 2020-06-18
        • 2020-07-14
        • 2016-07-27
        • 2015-06-28
        • 2019-11-28
        • 2021-08-22
        • 2014-11-14
        相关资源
        最近更新 更多