分层抽样模拟人口分布答案

【问题标题】：Stratified Sampling to mimic Population Distribution分层抽样模拟人口分布
【发布时间】：2020-06-17 04:12:33
【问题描述】：

我是 R 新手，我最近对训练和测试拆分使用了分层抽样，以确保目标标签的比例相等，现在我想对训练数据进行下采样，以使总体分布/训练分布相似到新的下样本分布。

我想要下采样的原因是因为我有 1100 万行 56 列，并且通过网格/随机/贝叶斯搜索进行参数调整需要几天时间

我正在使用 XGboost，这是一个二元分类问题

如果有人可以帮助我，我将不胜感激。

下面是我的代码

    train_rows = sample.split(df$ModelLabel, SplitRatio=0.7) ## Stratiefied sampling 
    train = df[ train_rows,]
    test  = df[!train_rows,]`enter code here`

【问题讨论】：

标签： r machine-learning random sampling downsampling

【解决方案1】：

实现此目的的最简单方法是计算 2 个类别之间的比率。假设在 1100 万个中有 300 万个 0 和 800 万个 1。所以，你的 0:1 比例是 3:8。现在，假设您要将其下采样到 100 万行，您可以随机选择 100 万行保持相同的比率，即 3:8。所以从数学上讲，它大约有 27 万个（大约）0 类样本和 73 万个 1 类样本（大约）。您可以自己计算确切的数字。现在，您可以使用 Dataframe.sample() 函数来获取下采样数据。我正在为此编写 python 代码。

df_class_0 = df[df.target == 0]
df_class_1 = df[df.target == 1]
df_class_0_under = df_class_0.sample(2.7 lakh)
df_class_1_under = df_class_1.sample(7.3 lakh)
df_test_under = pd.concat([df_class_0_under, df_class_1_under], axis=0)

【讨论】：