可以缓存 scikit-learn 估计器的计算成本高昂的 .fit() 方法吗？答案

【问题标题】：Possibility to cache computationally expensive .fit() method of a scikit-learn estimator?可以缓存 scikit-learn 估计器的计算成本高昂的 .fit() 方法吗？
【发布时间】：2022-01-14 04:47:56
【问题描述】：

我正在使用sklearn.ensemble.BaggingClassifier 在我的数据上拟合 1000 个估计器。我想知道是否可以缓存此类的.fit() 方法，以便在第一次运行脚本后可以简单地缓存.fit() 方法的输出？

以此为例：

import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

rng = np.random.RandomState(42)
X, y = make_classification(n_samples=1000,n_features=4,n_informative=2,
                           n_redundant=0,random_state=rng,shuffle=False)

# initialize bagging classifier
clf = BaggingClassifier(base_estimator=LogisticRegression(),n_estimators=1000,random_state=rng)
result = clf.fit(X,y).estimators_[0].coef_

编辑：

下面的答案似乎给出了关于如何在拟合过程中在中缓存结果的答案，即在拟合 n 个估计器时，尝试使用来自已拟合分类器的预先计算的结果。相反，我正在寻找的是一种缓存整个过程的方法。我想知道是否可以为此使用sklearn.pipeline.Pipeline？修改后的代码将如下所示（如果这是正确的，很高兴收到反馈）：

from sklearn.pipeline import Pipeline
# initialize bagging classifier
clf = BaggingClassifier(base_estimator=LogisticRegression(),n_estimators=1000,random_state=rng)
pipe = Pipeline([('clf',clf)],memory='./cache')
result = pipe.fit(X,y)._final_estimator.estimators_[0].coef_

【问题讨论】：

您能否更具体地说明缓存fit() 方法的含义？当您拟合同一个数据集时，您可以使用warm_start，但如果您想拟合不同批次的数据partial_fit 有效
拥有合适的分类器还不够吗？如果没有，以什么方式？
对不起，你能说得更具体点吗？我不确定我是否理解你的问题

标签： python caching scikit-learn

【解决方案1】：

您可以使用warm_start 重用上一次调用的解决方案，以适应并向集成添加更多估计器。

使用此解决方案，您可以批量计算估算器并保存当前模型状态。

下面是warm_start的例子：

from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)

clf = BaggingClassifier(base_estimator=SVC(),
                        n_estimators=10, random_state=0, warm_start=True)

for i in range(5):
    clf.fit(X, y)
    print(f'Iteration {i} score with {clf.n_estimators}: {clf.score(X, y)}')
    clf.n_estimators += 10

这将输出以下内容：

Iteration 0 score with 10: 0.92
Iteration 1 score with 20: 0.93
Iteration 2 score with 30: 0.92
Iteration 3 score with 40: 0.92
Iteration 4 score with 50: 0.92

【讨论】：

但是如果我做对了，这不会缓存整个拟合过程吗？换句话说，当您在数据集上放置 1000 个估计器，然后关闭并重新启动 python 会话并再次调用 .fit()，它不会缓存以前计算的结果，对吧？
如果您想保存并在之后重新启动拟合，您应该在重新启动 python 之前序列化您的模型。然后您可以加载模型并继续训练过程。