【发布时间】:2017-09-10 13:56:52
【问题描述】:
我正在从 Sklearn 运行 GradientBoostingClassifier,我从详细输出中得到了一些奇怪的输出。我从我的整个数据集中随机抽取 10% 的样本,大多数看起来都很好,但有时我会得到奇怪的输出和糟糕的结果。有人可以解释发生了什么吗?
“好”结果:
n features = 168
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.01, loss='deviance', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=2000, presort='auto', random_state=None,
subsample=1.0, verbose=1, warm_start=False)
Iter Train Loss Remaining Time
1 0.6427 40.74m
2 0.6373 40.51m
3 0.6322 40.34m
4 0.6275 40.33m
5 0.6230 40.31m
6 0.6187 40.18m
7 0.6146 40.34m
8 0.6108 40.42m
9 0.6071 40.43m
10 0.6035 40.28m
20 0.5743 40.12m
30 0.5531 39.74m
40 0.5367 39.49m
50 0.5237 39.13m
60 0.5130 38.78m
70 0.5041 38.47m
80 0.4963 38.34m
90 0.4898 38.22m
100 0.4839 38.14m
200 0.4510 37.07m
300 0.4357 35.49m
400 0.4270 33.87m
500 0.4212 31.77m
600 0.4158 29.82m
700 0.4108 27.74m
800 0.4065 25.69m
900 0.4025 23.55m
1000 0.3987 21.39m
2000 0.3697 0.00s
predicting
this_file_MCC = 0.5777
“坏”结果:
Training the classifier
n features = 168
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=1.0, loss='deviance', max_depth=5,
max_features='sqrt', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=500, presort='auto', random_state=None,
subsample=1.0, verbose=1, warm_start=False)
Iter Train Loss Remaining Time
1 0.5542 1.07m
2 0.5299 1.18m
3 0.5016 1.14m
4 0.4934 1.16m
5 0.4864 1.19m
6 0.4756 1.21m
7 0.4699 1.24m
8 0.4656 1.26m
9 0.4619 1.24m
10 0.4572 1.26m
20 0.4244 1.27m
30 0.4063 1.24m
40 0.3856 1.20m
50 0.3711 1.18m
60 0.3578 1.13m
70 0.3407 1.10m
80 0.3264 1.09m
90 0.3155 1.06m
100 0.3436 1.04m
200 0.3516 46.55s
300 1605.5140 29.64s
400 52215150662014.0469 13.70s
500 585408988869401440279216573629431147797247696359586211550088082222979417986203510562624281874357206861232303015821113689812886779519405981626661580487933040706291550387961400555272759265345847455837036753780625546140668331728366820653710052494883825953955918423887242778169872049367771382892462080.0000 0.00s
predicting
this_file_MCC = 0.0398
【问题讨论】:
-
你能找出导致这个问题的数据样本吗?
-
我正在对大约 100 万行数据集的“咬样本”进行训练。每个样本大约有 100k 行。问题似乎与输入数据无关,因为我在相同的示例文件上运行了 sklearn.ensemble.ExtraTreesClassifier,没有出现错误。
-
好的。我在问这样我们就可以有一个可重现的例子来在 sklearn 上放置一个错误。
标签: python python-2.7 scikit-learn gradient-descent