具有 Stratifiedkfold 的随机过采样 - 值错误答案

【问题标题】：Ramdom Oversampling with Stratified KFold - Value Error具有 Stratifiedkfold 的随机过采样 - 值错误
【发布时间】：2021-08-07 11:28:54
【问题描述】：

我有一个看起来像这样的数据框。使用标准缩放器和为所有分类变量添加的虚拟变量对数据集进行标准化。它现在分为训练集和测试集。

            amt    gender   city_pop    birth_year  distance        
153118  -0.786537   0.0    -0.318571    0.913779    -0.400876   
153226  -0.488455   0.0    -0.322397    0.741579     1.384297   
153228  0.437970    0.0    -0.329167    1.774776    -0.658839   
153303  -0.877627   0.0    -0.329656    1.258177    -1.100713   
153313  0.462143    1.0    -0.313817    1.372977     0.038791

我现在正在尝试使用 RandomOverSampler 和 StratifiedKFold Cross Validatio 使用这些数据创建一些模型（如逻辑回归、决策树和随机森林）。这是因为我的目标变量上的少数类只有 0.3%。

我已经用不平衡的数据创建了模型，而且效果很好。但是当我尝试进行采样时，出现以下错误。此处还包括我的代码。

from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import RandomOverSampler

skf = StratifiedKFold(n_splits=5, random_state=None)

for fold, (train_index, test_index) in enumerate(skf.split(X,y), 1):
    X_train = X.reindex(index = train_index)
    y_train = y.reindex(index = train_index) 
    X_test = X.reindex(index = test_index)
    y_test = y.reindex(index = test_index)
    ROS = RandomOverSampler(sampling_strategy=0.5)
    X_over, y_over= ROS.fit_resample(X_train, y_train)
  
#Create Dataframe for X_over
X_over = pd.DataFrame(data=X_over,   columns=X_train.columns)

我收到以下错误。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-90-372645e869d1> in <module>
      4 oversample = RandomOverSampler(sampling_strategy=1)
      5 # fit and apply the transform
----> 6 X_over, y_over = oversample.fit_resample(X_train, y_train)

~\anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
     73             The corresponding label of `X_resampled`.
     74         """
---> 75         check_classification_targets(y)
     76         arrays_transformer = ArraysTransformer(X, y)
     77         X, y, binarize_y = self._check_X_y(X, y)

~\anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
    178     y : array-like
    179     """
--> 180     y_type = type_of_target(y)
    181     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    182                       'multilabel-indicator', 'multilabel-sequences']:

~\anaconda3\lib\site-packages\sklearn\utils\multiclass.py in type_of_target(y)
    301     if y.dtype.kind == 'f' and np.any(y != y.astype(int)):
    302         # [.1, .2, 3] or [[.1, .2, 3]] or [[1., .2]] and not [1., 2., 3.]
--> 303         _assert_all_finite(y)
    304         return 'continuous' + suffix
    305 

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    104                     msg_err.format
    105                     (type_err,
--> 106                      msg_dtype if msg_dtype is not None else X.dtype)
    107             )
    108     # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

【问题讨论】：

标签： python machine-learning data-science k-fold oversampling

【解决方案1】：

看到数据后再回答会更好。但我建议在交叉验证步骤之前过采血。请尝试。

【讨论】：