如何在 python 中的大型数据集上训练随机森林？答案

【问题标题】：How to train Random Forest on large datasets in python?如何在 python 中的大型数据集上训练随机森林？
【发布时间】：2024-05-29 20:25:02
【问题描述】：

我有一个相当大的数据，包括 1M 样本和 1K 特征（一个 1M x 1K 矩阵），我正在尝试用它来训练一个随机森林来解决二元分类问题。这是我通常在数据不是很大时用来训练随机森林的代码。我首先使用 pandas 从 .csv 文件中读取数据：

    training_all = pd.DataFrame(np.random.random_sample((100,4)), columns=list('ABCD'))
    training_all['Label'] = random.choices([0,1],k=100)
    test_data = pd.DataFrame(np.random.random_sample((20,4)), columns=list('ABCD'))
    test_data['Label'] = random.choices([0,1],k=20)

然后创建一个超参数池：

    # Number of trees in random forest
    n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
    # Number of features to consider at every split
    max_features = ['auto', 'sqrt']
    # Maximum number of levels in tree
    max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
    max_depth.append(None)
    # Minimum number of samples required to split a node
    min_samples_split = [2, 5, 10]
    # Minimum number of samples required at each leaf node
    min_samples_leaf = [1, 2, 4]
    # Method of selecting samples for training each tree
    bootstrap = [True, False]
    # Create the random grid
    hyperparameters = {'n_estimators': n_estimators,
                   'max_features': max_features,
                   'max_depth': max_depth,
                   'min_samples_split': min_samples_split,
                   'min_samples_leaf': min_samples_leaf,
                   'bootstrap': bootstrap}

然后，打乱数据：

    training_all_shuffled = training_all.sample(frac=1).reset_index(drop=True)  
    test_data_shuffled = test_data.sample(frac=1).reset_index(drop=True)

最后使用 sklearn 创建并训练一个随机森林：

    randomCV = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=hyperparameters, n_iter=10, cv=5,scoring="f1")
    randomCV.fit(training_all_shuffled.iloc[:,:-1], training_all_shuffled['Label'])
    best_rf_model= randomCV.best_estimator_

    rf_predictions = best_rf_model.predict(test_data_shuffled.iloc[:,:-1])

有哪些方法可以在合理的时间内在 1M x 1K 数据集上运行？关于如何读取数据的任何提示（数据也很大，如果我不必将其全部读取到内存中会很好），超参数的范围，并行化等非常有帮助。谢谢

【问题讨论】：

您可以考虑使用计算机集群和 SparkML (spark.apache.org/docs/1.2.2/ml-guide.html) 吗？

标签： python pandas scikit-learn random-forest

【解决方案1】：

如果你有 GPU，你可以使用cuML。

但是，我不知道 RandomizedSearchCV 是否属于它的特征。

【讨论】：

【解决方案2】：

我建议使用optuna，这是一个用于超参数优化的库，非常容易实现，可以在parallel 中运行并使用 GPU。

【讨论】：