使用 SMOTE 库在 Python 中平衡数据答案

【问题标题】：balancing data in Python by using SMOTE library使用 SMOTE 库在 Python 中平衡数据
【发布时间】：2020-02-27 21:32:24
【问题描述】：

我想平衡一组具有以下特征的训练数据，并将其分开在 X_train 和 y_train 中。我的班级比例大致如下：

class A: 54%
class B: 45%
class C: 1%

所以我想将我的数据重新采样如下：

class A: 49%
class B: 41%
class C: 10%

我要使用的库是：

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html

并使用 Smote 作为平衡算法。我遇到的问题是我不知道如何使用这个库来做到这一点。我知道 Smote 算法，但我在使用这个库时遇到了一些困难。有什么帮助吗？

谢谢

【问题讨论】：

标签： python balance

【解决方案1】：

你以前用过 sklearn 吗？这与它的工作方式非常相似。有效地使用 smote 本身就像在数据上运行模型以生成更多虚拟数据来平衡它。

imblearn page 中的这个例子很好地描述了它：

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imblearn.over_sampling import SMOTE # doctest: +NORMALIZE_WHITESPACE
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({1: 900, 0: 100})
>>> sm = SMOTE(random_state=42)
>>> X_res, y_res = sm.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({0: 900, 1: 900})

特别是当您拥有训练数据 X 和目标 y 时，如果您愿意，您可以实例化一个具有随机状态的 SMOTE() 实例。然后，您将其拟合到您的数据 X_res,y_res = sm.fit_resample(X,y)。 fit_resample() 合二为一，它将 SMOTE 算法拟合到您的数据集，然后使用新的过采样数据集转换（重新采样）您的数据集。

【讨论】：

是否可以只对一个数据帧使用 sm.fit_resample 而不将我们的数据分成 X 和 Y？