在 scikit learn 中结合随机森林模型答案

【问题标题】：Combining random forest models in scikit learn在 scikit learn 中结合随机森林模型
【发布时间】：2015-04-13 21:48:30
【问题描述】：

我有两个 RandomForestClassifier 模型，我想将它们组合成一个元模型。他们都使用相似但不同的数据进行训练。我该怎么做？

rf1 #this is my first fitted RandomForestClassifier object, with 250 trees
rf2 #this is my second fitted RandomForestClassifier object, also with 250 trees

我想创建big_rf，将所有树组合成一个 500 棵树模型

【问题讨论】：

标签： python python-2.7 scikit-learn classification random-forest

【解决方案1】：

我相信这可以通过修改 RandomForestClassifier 对象的 estimators_ 和 n_estimators 属性来实现。森林中的每棵树都存储为一个 DecisionTreeClassifier 对象，这些树的列表存储在estimators_ 属性中。为了确保没有不连续性，更改n_estimators 中的估计器数量也是有意义的。

这种方法的优点是您可以在多台机器上并行构建一堆小森林并将它们组合起来。

这是一个使用 iris 数据集的示例：

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_iris

def generate_rf(X_train, y_train, X_test, y_test):
    rf = RandomForestClassifier(n_estimators=5, min_samples_leaf=3)
    rf.fit(X_train, y_train)
    print "rf score ", rf.score(X_test, y_test)
    return rf

def combine_rfs(rf_a, rf_b):
    rf_a.estimators_ += rf_b.estimators_
    rf_a.n_estimators = len(rf_a.estimators_)
    return rf_a

iris = load_iris()
X, y = iris.data[:, [0,1,2]], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
# in the line below, we create 10 random forest classifier models
rfs = [generate_rf(X_train, y_train, X_test, y_test) for i in xrange(10)]
# in this step below, we combine the list of random forest models into one giant model
rf_combined = reduce(combine_rfs, rfs)
# the combined model scores better than *most* of the component models
print "rf combined score", rf_combined.score(X_test, y_test)

【讨论】：

有没有办法将其推广到使用其他模型——逻辑回归、Guasian NB、SVM
@mgoldwasser 嗨，我刚刚读到你的回答，我有一个更一般的问题。我可以使用长度不同的特征吗？例如，一个有 300 个样本，另一个有 200 个吗？很抱歉跑题了，但看了你的回答，我正在考虑为每个功能建立一个森林。
rf_a.n_estimators = len(rf_a.estimators_) .. Err.. 不应该这样； rf_a.n_estimators += len(rf_a.n_estimators) ????
@SoftwareMechanic 代码正确。 rf_a.estimators在上一行更新，它的长度就是我们想要的n_estimators

【解决方案2】：

除了@mgoldwasser 解决方案之外，另一种方法是在训练您的森林时使用warm_start。在 Scikit-Learn 0.16-dev 中，您现在可以执行以下操作：

# First build 100 trees on X1, y1
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X1, y1)

# Build 100 additional trees on X2, y2
clf.set_params(n_estimators=200)
clf.fit(X2, y2)

【讨论】：

当两个数据集具有不同数量的标签时，warm_start 似乎不起作用。例如，如果您有 (x1, y1)，其中 y1 可以使用 3 个标签，然后 (x2,y2)，其中 y2 可以使用额外的标签，则使用 warm_start 进行训练会失败。交换顺序仍然会导致错误。
@user929404 指出显而易见的是，该模型正在一个 numpy 数组中的无名列上进行训练。当您最初训练模型时，它会查看 y1 以确定将要训练多少特征，而当您继续训练 y2 时，必须有相同数量的特征，因为它无法神奇地理解如何第一个矩阵的变量与第二个矩阵的变量对齐，除非它假设它们是相同的。
这种方法会影响使用的数据集的顺序吗？如果有 3 个数据集，如果它们每次都以不同的顺序进行训练，会有什么不同吗？