【问题标题】:Resampling (boostrap) a data set of continious data for regression problem重采样(引导)回归问题的连续数据数据集
【发布时间】:2020-04-18 15:07:31
【问题描述】:

对于回归问题,我有一个训练数据集: - 3 个具有高斯分布的变量 - 20 个均匀分布的变量。

我所有的变量都是连续的,在 [0;1] 之间。

问题是测试数据,用于评分我的回归模型的所有变量分布均匀。 实际上,我在尾部分布上的结果很差,所以我想对我的训练集进行过采样,以复制最稀有的行。

所以我的想法是在我的训练集上引导(使用带替换的抽样),以获得一组与测试集具有相同分布的数据。

为了做到这一点,我的想法(不知道它是否好!)是为我的 3 个变量添加带有间隔的 3 列,并使用这些列对重采样进行分层。

示例: 一、生成数据

from scipy.stats import truncnorm
def get_truncated_normal(mean=0.5, sd=0.15, min_value=0, max_value=1):
    return truncnorm(
        (min_value - mean) / sd, (max_value - mean) / sd, loc=mean, scale=sd)

generator = get_truncated_normal()


import numpy as np
from sklearn.preprocessing import MinMaxScaler
S1 = generator.rvs(1000)
S2 = generator.rvs(1000)
S3 = generator.rvs(1000)
u = np.random.uniform(0, 1, 1000)

然后检查分布:

import seaborn as sns
sns.distplot(u);
sns.distplot(S2);

没关系,那我加个分类栏目

import pandas as pd
df = pd.DataFrame({'S1':S1,'S2':S2,'S3':S3,'Unif':u})

BINS_NUMBER = 10
df['S1_range'] = pd.cut(df.S1, 
                            bins=BINS_NUMBER, 
                            precision=6,
                            right=True, 
                            include_lowest=True)
df['S2_range'] = pd.cut(df.S2, 
                            bins=BINS_NUMBER, 
                            precision=6,
                            right=True, 
                            include_lowest=True)
df['S3_range'] = pd.cut(df.S3, 
                            bins=BINS_NUMBER, 
                            precision=6,
                            right=True, 
                            include_lowest=True)

支票

df.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709]      3
(0.116709, 0.210454]                 15
(0.210454, 0.304199]                 64
(0.304199, 0.397944]                152
(0.397944, 0.491689]                254
(0.491689, 0.585434]                217
(0.585434, 0.679179]                173
(0.679179, 0.772924]                 86
(0.772924, 0.866669]                 30
(0.866669, 0.960414]                  6
dtype: int64

这对我有好处。 所以现在我将尝试重新采样,但它没有按预期工作

from sklearn.utils import resample
df_resampled = resample(df,replace=True,n_samples=1000, stratify=df['S1_range'])
df_resampled.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709]      3
(0.116709, 0.210454]                 15
(0.210454, 0.304199]                 64
(0.304199, 0.397944]                152
(0.397944, 0.491689]                254
(0.491689, 0.585434]                217
(0.585434, 0.679179]                173
(0.679179, 0.772924]                 86
(0.772924, 0.866669]                 30
(0.866669, 0.960414]                  6
dtype: int64

所以它不起作用,我在输出和输入中得到相同的分布......

你能帮帮我吗? 也许这不是这样做的好方法?

谢谢!!

【问题讨论】:

  • 您是否尝试过查看现有库以平衡您的数据集,例如imbalanced ?
  • 是的,但是我没有找到如何将它用于回归问题,所有示例都是关于分类的。
  • 尝试将labels=list(range(BINS_NUMBER)) 添加到您对 pd.cut 的调用中。当您使用 resample 时,它​​会将 S1_range 值视为分层标签。但是您的值包含 Interval 对象。这可能会导致问题,因为它们都可能被视为不同的对象。
  • 是一样的。我找到了另一种有效的方法。你帮我回答第一个。我发现,即使是回归问题,重采样也可以看作是目标是每一行的 bin 的分类问题。

标签: python machine-learning scikit-learn regression imbalanced-data


【解决方案1】:

您应该利用库来重新采样回归数据,而不是从头开始编写代码来重新采样连续数据。

虽然流行的库(不平衡学习等)专注于分类(分类)变量,但最近有一个 Python 库(称为 resreg - RESampling for REGression)允许您对连续数据重新采样 (resreg GitHub page)

此外,您可能希望在正态分布变量的尾部生成合成数据点,而不是自举,因为这样做可能会产生更好的结果(请参阅this paper)。类似于 SMOTE 用于分类,在特征之间进行插值,您可以使用 resreg 包中的 SMOTER(SMOTE 用于回归)在回归/连续数据中生成合成值。

这是一个示例,说明如何使用 resreg 通过几行代码实现重采样:


import numpy as np
import resreg


cl = np.percentile(y,10)  # Oversample values less than the 10th percentile
ch = np.percentile(y,90)  # Oversample values less than the 10th percentile


# Assign relevance scores to indicate which samples in your dataset are
# to be resampled. Values below cl and above ch are assigned a relevance 
# value above 0.5, other values are assigned a relevance value above 0.5

relevance = resreg.sigmoid_relevance(X, y, cl=cl, ch=ch)


# Resample the relevant values (i.e relevance >= 0.5) by interpolating 
# between nearest k-neighbors (k=5). By setting over='balance', the 
# relevant values are oversampled so that the number of relevant and
# irrelevant values are equal

X_res, y_res = resreg.smoter(X, y, relevance=relevance, relevance_threshold=0.5, k=5, over='balance', random_state=0)


【讨论】:

    【解决方案2】:

    我的解决方案:

    def create_sampled_data_set(n_samples_by_bin=1000,
                                n_bins=10,
                                replace=True,
                                save_csv=True):
        """In order to have the same distribution for S1..S3 between training
        set and test set, this function will generate a new
        training set resampled
    
        Return: (X_train, y_train)
        """
        def stratified_sample_df_(df, col, n_samples, replace=True):
            if replace:
                n = n_samples
            else:
                n = min(n_samples, df[col].value_counts().min())
    
            df_ = df.groupby(col).apply(lambda x: x.sample(n, replace=replace))
            df_.index = df_.index.droplevel(0)
            return df_
    
        X_train, y_train = load_data_for_train()
    
        # merge the dataframe for the sampling. Target will be removed after
        X_train = pd.merge(
            X_train, y_train[['Target']], left_index=True, right_index=True)
        del y_train
    
        # build a categorical feature, from S1..S3 distribution
        disc = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='kmeans')
        disc.fit(X_train[['S1', 'S2', 'S3']])
        y_bin = disc.transform(X_train[['S1', 'S2', 'S3']])
        del disc
        vint = np.vectorize(np.int)
        y_bin = vint(y_bin)
    
        y_concat = []
        for i in range(len(y_bin)):
            a = y_bin[i, 0].astype('str')
            b = y_bin[i, 1].astype('str')
            c = y_bin[i, 2].astype('str')
            y_concat.append(a + ';' + b + ';' + c)
        del y_bin
    
        X_train['S_Class'] = y_concat
        del y_concat
    
        X_train_resampled = stratified_sample_df_(
            X_train, 'S_Class', n_samples_by_bin)
        del X_train
        y_train_resampled = X_train_resampled[['Target']].copy()
        y_train_resampled.rename(
            columns={y_train_resampled.columns[0]: 'Target'}, inplace=True)
    
        X_train_resampled = X_train_resampled.drop(['S_Class', 'Target'], axis=1)
    
        # save in file for further usage
        if save_csv:
            X_train_resampled.to_csv(
                "./data/training_input_resampled.csv", sep=",")
            y_train_resampled.to_csv(
                "./data/training_output_resampled.csv", sep=",")
    
        return(X_train_resampled,
               y_train_resampled)
    

    【讨论】:

      猜你喜欢
      • 2012-12-15
      • 1970-01-01
      • 1970-01-01
      • 2021-10-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多