sample_weight 与 scikit-learn 中的 class_weight 相比如何？答案

【问题标题】：How does sample_weight compare to class_weight in scikit-learn?sample_weight 与 scikit-learn 中的 class_weight 相比如何？
【发布时间】：2018-05-04 02:31:33
【问题描述】：

我想在不平衡分类问题上使用sklearn.ensemble.GradientBoostingClassifier。我打算针对Area Under the Receiver Operating Characteristic Curve (ROC AUC) 进行优化。为此，我想重新调整我的班级，使小班对分类器更重要。

这通常可以通过设置 class_weight = “balanced” 来完成（例如在 RandomForestClassifier 中），但 GradientBoostingClassifier 中没有这样的参数。

文档说：

“平衡”模式使用 y 的值自动调整权重，与输入数据中的类频率成反比，如 n_samples / (n_classes * np.bincount(y))

如果 y_train 是我的目标数据框，其中元素在 {0,1}，那么文档暗示这应该与 class_weight = “balanced” 相同

sample_weight = y_train.shape[0]/(2*np.bincount(y_train))
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train,sample_weight = sample_weight[y_train.values])

这是正确的还是我遗漏了什么？

【问题讨论】：

标签： python machine-learning scikit-learn

【解决方案1】：

我建议您在 scikit-learn 中使用 class_weight.compute_sample_weight 实用程序。例如：

from sklearn.utils.class_weight import compute_sample_weight
y = [1,1,1,1,0,0,1]
compute_sample_weight(class_weight='balanced', y=y)

输出：

array([ 0.7 ,  0.7 ,  0.7 ,  0.7 ,  1.75,  1.75,  0.7 ])

您可以将其用作sample_weight 关键字的输入。

【讨论】：

谢谢，这会产生与我的代码相同的数组，但更清晰。