Scikit-Learn GroupShuffleSplit 未按指定组分组答案

【问题标题】：Scikit-Learn GroupShuffleSplit is not grouping by specified groupsScikit-Learn GroupShuffleSplit 未按指定组分组
【发布时间】：2020-02-17 17:42:22
【问题描述】：

我正在尝试拆分 8 年来每天采集的农场数据的时间序列。我想拆分数据，以便训练集和测试集各自包含来自不同农场的样本，并且训练集和测试集之间的农场没有重叠。我在数据框中创建了一个列，其中包含唯一的 FarmID，指示样本来自哪个农场。

从视觉上看，是数据集的一般外观：

df

╔════════╦════════════╦═══════════╦═════╦═══════════╗
║ FarmID ║  datetime  ║ Feature_1 ║ ... ║ Feature_n ║
╠════════╬════════════╬═══════════╬═════╬═══════════╣
║ 0      ║ 2009-01-01 ║ 45.76     ║ ... ║ 15.12     ║
║ ...    ║ ...        ║ ...       ║ ... ║ ...       ║
║ 3668   ║ 2017-12-31 ║ 12.12     ║ ... ║ 15.75     ║
╚════════╩════════════╩═══════════╩═════╩═══════════╝
6702142 rows × 35 columns


df[df.FarmID==0]

╔════════╦════════════╦═══════════╦═════╦═══════════╗
║ FarmID ║  datetime  ║ Feature_1 ║ ... ║ Feature_n ║
╠════════╬════════════╬═══════════╬═════╬═══════════╣
║ 0      ║ 2009-01-01 ║ 35.31     ║ ... ║ 67.41     ║
║ ...    ║ ...        ║ ...       ║ ... ║ ...       ║
║ 0      ║ 2017-12-31 ║ 2.15      ║ ... ║ 5.21      ║
╚════════╩════════════╩═══════════╩═════╩═══════════╝
1096 rows x 35 columns


# Note: Not all farms contain the same number of samples as some farms didn't submit data in some years.

为了拆分数据集，这是我使用的代码：

df = df.sort_values('FarmID')

def group_split(df, test_size=.80, seed=seed):
    from sklearn.model_selection import GroupShuffleSplit
    gss = GroupShuffleSplit(1, test_size, random_state=seed)

    for test_indices, train_indices in gss.split(df, groups=df.FarmID):
        train = df.loc[train_indices]
        test = df.loc[test_indices]

    return train, test

train, test = group_split(df)

在检查训练测试拆分中包含的独特农场时，我发现训练和测试集中都包含一些农场。

In: train.FarmID.unique()

Out: array([2.000e+00, 4.000e+00, 8.000e+00, ..., 2.245e+03, 2.229e+03,
            2.575e+03])


In: test.FarmID.unique()

Out: array([0.000e+00, 1.000e+00, 1.300e+01, ..., 2.245e+03, 2.229e+03,
            2.575e+03])


In: n = 2245
    df[df.FarmID==n].shape
    train[train.FarmID==n].shape
    test[test.FarmID==n].shape

Out: (1826, 35)
     (1225, 35)
     (601, 35)

但是，有些农场是正确分割的。

In: n = 3668
    df[df.FarmID==n].shape
    train[train.FarmID==n].shape
    test[test.FarmID==n].shape

Out: (705, 35)
     (705, 35)
     (0, 35)

此外，3669 个农场中有 995 个在训练测试集中重叠。

In: train_FarmIDs = train.FarmID.unique()
    test_FarmIDs = test.FarmID.unique()
    len(set(train_FarmIDs).intersection(set(test_FarmIDs)))

Out: 995

我非常困惑为什么 sklearn 的 GroupShuffleSplit 没有按我正确指定的组进行拆分。如果有人可以帮助我解决这个问题，我将不胜感激！

【问题讨论】：

标签： python-3.x pandas scikit-learn data-science train-test-split

【解决方案1】：

只是一个猜测，但我认为 gss 正在将您的数据帧转换为 ndarray，并返回 ndarray 的位置索引。您对 df 进行排序，这会扰乱您的 df 索引，然后使用 .loc[]。尝试改用 .iloc[]，或者在使用 gss 之前将 df 转换为 numpy 数组，然后对 numpy 数组而不是数据帧进行切片。

【讨论】：

对这个问题仍然感兴趣？我确实遵循这个答案；这可能是有问题的 "groups=df.FarmID" 。它是获取 FarmID 还是获取 df 的索引？