在分层抽样中从每个类别中抽取相等的样本答案

【问题标题】：Drawing equal samples from each class in stratified sampling在分层抽样中从每个类别中抽取相等的样本
【发布时间】：2021-01-20 21:45:25
【问题描述】：

所以我有 1000 个 1 类和 2500 个 2 类。所以在使用时很自然：

sklearn 的train_test_split(test_size = 200, stratify = y)。我得到了一个不平衡的测试集，因为它保留了原始数据集中的数据分布。但是，我想在测试集中拆分为 100 个 1 类和 100 个 2 类。

我该怎么做？任何建议将不胜感激。

【问题讨论】：

标题有点误导。应该考虑改为“在分层抽样中每类抽取相等数量的样本”。

标签： python machine-learning scikit-learn classification

【解决方案1】：

手动拆分

手动解决方案并不可怕。主要步骤说明：

隔离第 1 类和第 2 类行的索引。
使用np.random.permutation() 分别为第1 类和第2 类随机选择n1 和n2 测试样本。
使用df.index.difference() 对训练样本执行逆向选择。

代码可以很容易地推广到任意数量的类和任意数量的测试数据（只需将n1/n2、idx1/idx2等放入列表并循环处理）。但这超出了问题本身的范围。

代码

import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd

# data
df = pd.DataFrame(
    data={
        "label": np.array([1]*1000 + [2]*2500),
        # label 1 has value > 0, label 2 has value < 0
        "value": np.hstack([np.random.uniform(0, 1, 1000),
                            np.random.uniform(-1, 0, 2500)])
    }
)
df = df.sample(frac=1).reset_index(drop=True)

# sampling number for each class
n1 = 100
n2 = 100

# 1. get indexes and lengths for the classes respectively
idx1 = df.index.values[df["label"] == 1]
idx2 = df.index.values[df["label"] == 2]
len1 = len(idx1)  # 1000
len2 = len(idx2)  # 2500

# 2. draw index for test dataset
draw1 = np.random.permutation(len1)[:n1]  # keep the first n1 entries to be selected
idx1_test = idx1[draw1]
draw2 = np.random.permutation(len2)[:n2]
idx2_test = idx2[draw2]
# combine the drawn indexes
idx_test = np.hstack([idx1_test, idx2_test])

# 3. derive index for train dataset
idx_train = df.index.difference(idx_test)

# split
df_train = df.loc[idx_train, :]  # optional: .reset_index(drop=True)
df_test = df.loc[idx_test, :]
# len(df_train) = 3300
# len(df_test) = 200    

# verify that no row was missing
idx_merged = np.hstack([df_train.index.values, df_test.index.values])
assert len(np.unique(idx_merged)) == 3500

【讨论】：

我也想知道 sklearn/scipy/pandas 中有一个内置函数可以做到这一点。不幸的是，我对官方文档和谷歌的调查一无所获。
我也浏览了几个文档，但没有运气。我只是认为通过一个包来实现一个功能会更“可靠”。否则感谢您的建议。