XGBoost 和 multiprocessing 的问题在于,当您将 XGBoost 传递给多处理时,您不知道它在后台做了什么。这不仅在 XGBoost 中很常见,在其他库中也很常见,我发现当您使用不同的进程来加快工作速度时,它们中的许多会挂起。
我倾向于保存模型,然后使用多处理创建一个新的 XGBClassifier 并从那里加载模型。它更安全,我更确定不会发生任何奇怪的事情。
这是一个例子
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from concurrent.futures import ProcessPoolExecutor, as_completed
from sklearn.metrics import accuracy_score
def get_data():
titanic = pd.read_csv('titanic.csv', usecols=['Age', 'Cabin', 'Sex', 'Survived'])
# fill all missing values with the average
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].mean())
# create indicator for cabin because there is areason it was missing
titanic['Cabin_ind'] = np.where(titanic['Cabin'].isna(), 0, 1)
# convert sex to numeric
titanic['Sex'] = np.where(titanic['Sex'] == 'male', 0, 1)
# convert Survived to numeric
titanic['Survived'] = np.where(titanic['Survived'] == 'yes', 1, 0)
features = titanic[['Age', 'Sex', 'Cabin_ind']]
labels = titanic['Survived']
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
return X_train, X_test, y_train, y_test
def train_model(X_train, y_train):
print('Training model')
base_model = XGBClassifier(objective='binary:logistic', nthread=4, seed=27, scoring='auc', use_label_encoder=False)
base_model.fit(X_train, y_train)
base_model.save_model('model.json')
def score_model(i, X_test, y_test):
print('{}: Scoring model'.format(i))
base_model = XGBClassifier(objective='binary:logistic', nthread=4, seed=27, scoring='auc', use_label_encoder=False)
base_model.load_model('model.json')
score = base_model.score(X_test, y_test)
y_pred = base_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
return i, score, accuracy
def main():
X_train, X_test, y_train, y_test = get_data()
with ProcessPoolExecutor(max_workers=1) as executor:
task = executor.submit(train_model, X_train, y_train)
task.result()
tasks = []
with ProcessPoolExecutor(max_workers=1) as executor:
# You can split your test here, I am passing it as a whole
for i in range(10):
# I also use i in function to identify result later in case of out of order completion
task = executor.submit(score_model, i, X_test, y_test)
tasks.append(task)
for task in as_completed(tasks):
i, score, accuracy = task.result()
print('Task {} completed: score={}, accuracy={}'.format(i, score, accuracy))
if __name__ == '__main__':
main()
编辑:
- 正如 here 所讨论的,建议在分叉后初始化 xgboost,因此我在另一个进程中训练
XGBoost 以避免锁定
- 当您使用
XGBoost 时,最好强制use_label_encoder=False。原因是XGBoost,至少在我的电脑上抱怨
UserWarning: The use of label encoder in XGBClassifier is
deprecated and will be removed in a future release.
To remove this warning, do the following:
1) Pass option use_label_encoder=False when constructing XGBClassifier object; and
2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
- 尽量避免在
pandas 中使用inplace,正如here 所讨论的那样
- 使用
max_workers=some_cpu_number_that_makes_sense 使用多个cpu。
- 我用
XGBoost 训练了很多东西,根据我的经验,你并不真的需要multiprocessing,即使使用2 GB 的数据集也相当快。除非你有一个非常大的数据集......
我使用的csv
Age,Cabin,Sex,Survived,Embarked,Name,Ticket
,,male,yes,,,
19,,male,no,,,
20,1,female,yes,,,
25,2,male,no,,,
,3,female,yes,,,
40,3,male,yes,,,
,3,female,yes,,,
,,female,yes,,,
,3,male,no,,,
15,3,female,yes,,,