Scikit-learn 平衡子采样答案

【问题标题】：Scikit-learn balanced subsamplingScikit-learn 平衡子采样
【发布时间】：2014-06-20 18:23:41
【问题描述】：

我正在尝试为我的大型不平衡数据集创建 N 个平衡随机子样本。有没有办法简单地使用 scikit-learn / pandas 来做到这一点，还是我必须自己实现它？任何指向执行此操作的代码的指针？

这些子样本应该是随机的，并且可以重叠，因为我将每个子样本提供给一个非常大的分类器集合中的单独分类器。

在 Weka 中有一个名为 spreadsubsample 的工具，在 sklearn 中是否有等效的工具？ http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

（我知道权重，但这不是我想要的。）

【问题讨论】：

您只想将数据集拆分为 N 个大小相等的数据子集，还是真的只想执行交叉验证？请参阅cross_validation，特别是K-Fold
我知道交叉验证功能，问题是测试大小不能为零（它们会出错）。我正在使用巨大的（数万个分类器）集合，所以它必须很快。似乎没有这样令人惊讶的功能，所以我想我必须实现一个自定义功能。
仅供参考，现在存在一个用于学习和处理不平衡类数据的 sklearn-contrib 包github.com/scikit-learn-contrib/imbalanced-learn
@eickenberg，您还应该将该评论作为答案发布，找到答案比评论更容易，我会说使用已经存在的库可能是我最初问题的最佳答案。

标签： python pandas scikit-learn subsampling

【解决方案1】：

现在有一个成熟的 python 包来解决不平衡的数据。它以 sklearn-contrib 包的形式提供，地址为https://github.com/scikit-learn-contrib/imbalanced-learn

【讨论】：

【解决方案2】：

这是我的第一个版本，似乎工作正常，请随意复制或就如何提高效率提出建议（我在一般编程方面有相当长的经验，但在 python 或 numpy 方面的经验并不长）

此函数创建单个随机平衡子样本。

编辑：子样本大小现在对少数类进行抽样，这可能应该改变。

def balanced_subsample(x,y,subsample_size=1.0):

    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

对于任何尝试使用 Pandas DataFrame 进行上述操作的人，您需要进行一些更改：

将np.random.shuffle 行替换为

this_xs = this_xs.reindex(np.random.permutation(this_xs.index))
用
替换 np.concatenate 行
xs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')

【讨论】：

您如何将其扩展为使用自定义类来平衡样本，即不仅仅是 1 或 0，而是说 "no_region" 和 "region"（二进制非数字类）甚至 x 和 y是多类的吗？

【解决方案3】：

pandas Series 的一个版本：

import numpy as np

def balanced_subsample(y, size=None):

    subsample = []

    if size is None:
        n_smp = y.value_counts().min()
    else:
        n_smp = int(size / len(y.value_counts().index))

    for label in y.value_counts().index:
        samples = y[y == label].index.values
        index_range = range(samples.shape[0])
        indexes = np.random.choice(index_range, size=n_smp, replace=False)
        subsample += samples[indexes].tolist()

    return subsample

【讨论】：

【解决方案4】：

我找到了最好的解决方案here

这是我认为最简单的一个。

dataset = pd.read_csv("data.csv")
X = dataset.iloc[:, 1:12].values
y = dataset.iloc[:, 12].values

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)

那么你可以使用X_rus, y_rus数据

对于 0.4 版

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_rus, y_rus= rus.fit_sample(X, y)

然后，通过sample_indices_属性可以达到随机选择的样本的索引。

【讨论】：

【解决方案5】：

在sklearn.cross_validation 中公开的内置数据拆分技术中没有提供这种类型的数据拆分。

看起来与您的需求相似的是sklearn.cross_validation.StratifiedShuffleSplit，它可以生成任意大小的子样本，同时保留整个数据集的结构，即精心执行与主数据集中相同的不平衡。虽然这不是您要查找的内容，但您可以使用其中的代码并将强制比率始终更改为 50/50。

（如果您愿意的话，这可能是对 scikit-learn 的一个非常好的贡献。）

【讨论】：

实现起来应该很简单，即。将数据划分为 shuffle 类，然后只取每组的 N 个第一个元素。实现后我会看看是否可以轻松贡献它。
我发布了第一个实现作为答案。
我不确定你是否仍然感兴趣，但我同意sklearn 中没有专门的功能，在my answer below 中我建议了一个使用现有的sklearn 函数达到同等效果的方法。
OP 不是在寻找分层方法，它保持标签在折叠中的比例。你的答案和我的做分层。不同之处在于，在您的选择中，折叠不能重叠。在某些情况下可能需要这样做，但 OP 在这里明确允许重叠。

【解决方案6】：

以下是我创建平衡数据副本的 python 实现。假设： 1.目标变量（y）是二元类（0 vs. 1） 2. 1 是少数。

from numpy import unique
from numpy import random 

def balanced_sample_maker(X, y, random_seed=None):
    """ return a balanced data set by oversampling minority class 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx

    # oversampling on observations of positive label
    sample_size = uniq_counts[0]
    over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
    balanced_copy_idx = groupby_levels[0] + over_sample_idx
    random.shuffle(balanced_copy_idx)

    return X[balanced_copy_idx, :], y[balanced_copy_idx]

【讨论】：

【解决方案7】：

这是适用于多类组的上述代码的一个版本（在我的测试用例组 0、1、2、3、4 中）

import numpy as np
def balanced_sample_maker(X, y, sample_size, random_seed=None):
    """ return a balanced data set by sampling all classes with sample_size 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = np.unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        np.random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx
    # oversampling on observations of each label
    balanced_copy_idx = []
    for gb_level, gb_idx in groupby_levels.iteritems():
        over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
        balanced_copy_idx+=over_sample_idx
    np.random.shuffle(balanced_copy_idx)

    return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)

这还会返回索引，以便它们可用于其他数据集并跟踪每个数据集的使用频率（有助于训练）

【讨论】：

【解决方案8】：

通过二次采样 (uspl=True) 或过采样 (uspl=False) 来平衡 pandas DataFrame 的一个简短的 Pythonic 解决方案，由该数据帧中具有两个或多个值的指定列平衡。

对于uspl=True，此代码将抽取一个随机样本无需替换，其大小等于所有层中的最小层。对于uspl=False，此代码将抽取一个随机样本替换，其大小等于所有层中的最大层。

def balanced_spl_by(df, lblcol, uspl=True):
    datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
    lsz = [f.shape[0] for f in datas_l ]
    return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1)

这仅适用于 Pandas DataFrame，但这似乎是一个常见的应用程序，据我所知，将其限制为 Pandas DataFrames 会大大缩短代码。

【讨论】：

正是我希望找到的——使用 False 完美上采样而不是下采样我的数据帧。谢谢！

【解决方案9】：

mikkom 对最佳答案的轻微修改。

如果您想保留较大类数据的顺序，即。你不想洗牌。

代替

    if len(this_xs) > use_elems:
        np.random.shuffle(this_xs)

这样做

        if len(this_xs) > use_elems:
            ratio = len(this_xs) / use_elems
            this_xs = this_xs[::ratio]

【讨论】：

【解决方案10】：

只需使用以下代码在每个类中选择 100 行重复项。 activity 是我的类（数据集的标签）

balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))

【讨论】：

【解决方案11】：

这是我的 2 美分。假设我们有以下不平衡数据集：

import pandas as pd
import numpy as np

df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]),
                   'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]),
                   'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])})
print(df.head())

第一行：

  Category  Sentiment Gender
0        C          1      M
1        B          0      M
2        B          0      M
3        B          0      M
4        A          0      M

现在假设我们想要通过 Sentiment 获得一个平衡的数据集：

df_grouped_by = df.groupby(['Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Sentiment'])
df_balanced
print(df_balanced.head())

平衡数据集的第一行：

  Category  Sentiment Gender
0        C          0      F
1        C          0      M
2        C          0      F
3        C          0      M
4        C          0      M

让我们验证它在Sentiment方面是平衡的

df_balanced.groupby(['Sentiment']).size()

我们得到：

Sentiment
0    369
1    369
dtype: int64

如我们所见，我们最终得到了 369 个正面和 369 个负面情绪标签。

【讨论】：

【解决方案12】：

我的子采样器版本，希望对您有所帮助

def subsample_indices(y, size):
    indices = {}
    target_values = set(y_train)
    for t in target_values:
        indices[t] = [i for i in range(len(y)) if y[i] == t]
    min_len = min(size, min([len(indices[t]) for t in indices]))
    for t in indices:
        if len(indices[t]) > min_len:
            indices[t] = random.sample(indices[t], min_len)
    return indices

x = [1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1]
j = subsample_indices(x, 2)
print j
print [x[t] for t in j[-1]]
print [x[t] for t in j[1]]

【讨论】：

你能在你的回答中解释这比当前的例外答案更好吗？

【解决方案13】：

虽然已经回答了，但我偶然发现了你的问题，正在寻找类似的东西。经过一番研究，我相信sklearn.model_selection.StratifiedKFold可以用于此目的：

from sklearn.model_selection import StratifiedKFold

X = samples_array
y = classes_array # subsamples will be stratified according to y
n = desired_number_of_subsamples

skf = StratifiedKFold(n, shuffle = True)

batches = []
for _, batch in skf.split(X, y):
    do_something(X[batch], y[batch])

添加_ 很重要，因为skf.split() 用于为K-fold 交叉验证创建分层折叠，它返回两个索引列表：train（n - 1 / n 元素）和测试（ 1 / n 元素）。

请注意，这是截至sklearn 0.18。在sklearn 0.17 中，可以在模块cross_validation 中找到相同的功能。

【讨论】：

我刚刚注意到这个答案 - 如果它按假设工作，那么这可能正是我问这个问题时正在寻找的答案。感谢您迟到的回复！编辑：这不是我正在寻找的答案，因为这是分层的。对于 1000 个分类器的集合，样本量需要很大。
分层抽样是指样本中类的分布反映了原始数据集中类的分布。换句话说，如果您的数据集有 90% 的 0 类和 10% 的 1 类，那么您的样本将有 90% 的 0 类和 10% 的 1 类。类仍然不平衡。

【解决方案14】：

这是我的解决方案，可以紧密集成到现有的 sklearn 管道中：

from sklearn.model_selection import RepeatedKFold
import numpy as np


class DownsampledRepeatedKFold(RepeatedKFold):

    def split(self, X, y=None, groups=None):
        for i in range(self.n_repeats):
            np.random.seed()
            # get index of major class (negative)
            idxs_class0 = np.argwhere(y == 0).ravel()
            # get index of minor class (positive)
            idxs_class1 = np.argwhere(y == 1).ravel()
            # get length of minor class
            len_minor = len(idxs_class1)
            # subsample of major class of size minor class
            idxs_class0_downsampled = np.random.choice(idxs_class0, size=len_minor)
            original_indx_downsampled = np.hstack((idxs_class0_downsampled, idxs_class1))
            np.random.shuffle(original_indx_downsampled)
            splits = list(self.cv(n_splits=self.n_splits, shuffle=True).split(original_indx_downsampled))

            for train_index, test_index in splits:
                yield original_indx_downsampled[train_index], original_indx_downsampled[test_index]

    def __init__(self, n_splits=5, n_repeats=10, random_state=None):
        self.n_splits = n_splits
         super(DownsampledRepeatedKFold, self).__init__(
        n_splits=n_splits, n_repeats=n_repeats, random_state=random_state
    )

照常使用：

    for train_index, test_index in DownsampledRepeatedKFold(n_splits=5, n_repeats=10).split(X, y):
         X_train, X_test = X[train_index], X[test_index]
         y_train, y_test = y[train_index], y[test_index]

【讨论】：

【解决方案15】：

这是一个解决方案：

简单（
快速（除了一个 for 循环，纯 NumPy）
除了 NumPy 之外没有外部依赖项
生成新的平衡随机样本非常便宜（只需致电np.random.sample()）。有助于在训练时期之间生成不同的混洗和平衡样本

def stratified_random_sample_weights(labels):
    sample_weights = np.zeros(num_samples)
    for class_i in range(n_classes):
        class_indices = np.where(labels[:, class_i]==1)  # find indices where class_i is 1
        class_indices = np.squeeze(class_indices)  # get rid of extra dim
        num_samples_class_i = len(class_indices)
        assert num_samples_class_i > 0, f"No samples found for class index {class_i}"
        
        sample_weights[class_indices] = 1.0/num_samples_class_i  # note: samples with no classes present will get weight=0

    return sample_weights / sample_weights.sum()  # sum(weights) == 1

然后，您反复使用这些权重来生成具有np.random.sample() 的平衡索引：

sample_weights = stratified_random_sample_weights(labels)
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)

完整示例：

# generate data
from sklearn.preprocessing import OneHotEncoder

num_samples = 10000
n_classes = 10
ground_truth_class_weights = np.logspace(1,3,num=n_classes,base=10,dtype=float)  # exponentially growing
ground_truth_class_weights /= ground_truth_class_weights.sum()  # sum to 1
labels = np.random.choice(list(range(n_classes)), size=num_samples, p=ground_truth_class_weights)
labels = labels.reshape(-1, 1)  # turn each element into a list
labels = OneHotEncoder(sparse=False).fit_transform(labels)


print(f"original counts: {labels.sum(0)}")
# [  38.   76.  127.  191.  282.  556.  865. 1475. 2357. 4033.]

sample_weights = stratified_random_sample_weights(labels)
sample_size = 1000
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)

print(f"rebalanced counts: {labels[chosen_indices].sum(0)}")
# [104. 107.  88. 107.  94. 118.  92.  99. 100.  91.]

【讨论】：