【发布时间】:2024-05-29 20:25:02
【问题描述】:
我有一个相当大的数据,包括 1M 样本和 1K 特征(一个 1M x 1K 矩阵),我正在尝试用它来训练一个随机森林来解决二元分类问题。这是我通常在数据不是很大时用来训练随机森林的代码。我首先使用 pandas 从 .csv 文件中读取数据:
training_all = pd.DataFrame(np.random.random_sample((100,4)), columns=list('ABCD'))
training_all['Label'] = random.choices([0,1],k=100)
test_data = pd.DataFrame(np.random.random_sample((20,4)), columns=list('ABCD'))
test_data['Label'] = random.choices([0,1],k=20)
然后创建一个超参数池:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
hyperparameters = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
然后,打乱数据:
training_all_shuffled = training_all.sample(frac=1).reset_index(drop=True)
test_data_shuffled = test_data.sample(frac=1).reset_index(drop=True)
最后使用 sklearn 创建并训练一个随机森林:
randomCV = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=hyperparameters, n_iter=10, cv=5,scoring="f1")
randomCV.fit(training_all_shuffled.iloc[:,:-1], training_all_shuffled['Label'])
best_rf_model= randomCV.best_estimator_
rf_predictions = best_rf_model.predict(test_data_shuffled.iloc[:,:-1])
有哪些方法可以在合理的时间内在 1M x 1K 数据集上运行?关于如何读取数据的任何提示(数据也很大,如果我不必将其全部读取到内存中会很好),超参数的范围,并行化等非常有帮助。谢谢
【问题讨论】:
-
您可以考虑使用计算机集群和 SparkML (spark.apache.org/docs/1.2.2/ml-guide.html) 吗?
标签: python pandas scikit-learn random-forest