根据唯一ID将pandas数据框划分为测试和训练答案

【问题标题】：Divide pandas data frame into test and train based on unique ID根据唯一ID将pandas数据框划分为测试和训练
【发布时间】：2021-09-01 14:14:06
【问题描述】：

我想使用 id 列中的值将其拆分为两个数据帧（训练和测试）。拆分应该是这样的，在第一个数据帧中我有 70% 的（唯一）id，在第二个数据帧中，我有 30% 的 id。 id 应该是随机拆分的。

我有多个值对应一个id。

我正在尝试的以下脚本：

Training_data, Test_data = sklearn.model_selection.train_test_split(data, data['ID_sample'].unique(), train_size=0.30, test_size=0.70, random_state=5)

【问题讨论】：

data_train, data_test = train_test_split(data, test_size=0.3, stratify=data['ID_sample'])
@pratyaysengupta 欢迎来到 SO 社区，要获得帮助和体面的答案，您需要提供最少的数据集来重现环境并得出可解释的答案，请展示您数据框中的一些数据。

标签： python-3.x pandas dataframe scikit-learn

【解决方案1】：

按以下方式对问题进行排序

samplelist = data["ID_sample"].unique()
    training_samp, test_samp = sklearn.model_selection.train_test_split(samplelist, train_size=0.7, test_size=0.3, random_state=5, shuffle=True)
    
    training_data = data[data['ID_sample'].isin(training_samp)]
    test_data = data[data['ID_sample'].isin(test_samp)]

【讨论】：

【解决方案2】：

我不是 sklearn 专家，但我对此知之甚少，而且自从这个问题出现以来，我看到所有新人都曾问过类似的问题。

无论如何，这里是你可以解决的方法，你可以选择从sklearn.model_selection 导入import train_test_split 并完成它。我刚刚创建了一个随机数据并应用了它。

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(100, 2))
>>> df
           0         1
0  -1.214487  0.455726
1  -0.898623  0.268778
2   0.262315 -0.009964
3   0.612664  0.786531
4   1.249646 -1.020366
..       ...       ...
95 -0.171218  1.083018
96  0.122685 -2.214143
97 -1.420504  0.469372
98  0.061177  0.465881
99 -0.262667 -0.406031

[100 rows x 2 columns]
>>> from sklearn.model_selection import train_test_split
>>> train, test = train_test_split(df, test_size=0.3)

这是您的第一个数据框train

>>> train
           0         1
26 -2.651343 -0.864565
17  0.106959 -0.763388
78 -0.398269 -0.501073
25  1.452795  1.290227
47  0.158705 -1.123697
..       ...       ...
29 -1.909144 -0.732514
7   0.641331 -1.336896
43  0.769139  2.816528
59 -0.683185  0.442875
11 -0.543988 -0.183677

[70 rows x 2 columns]

这是第二个test 数据框：

>>> test
           0         1
30 -1.562427 -1.448936
24  0.638780  1.868500
70 -0.572035  1.615093
72  0.660455 -0.331350
82  0.295644 -0.403386
22  0.942676 -0.814718
15 -0.208757 -0.112564
45  1.069752 -1.894040
18  0.600265  0.599571
93 -0.853163  1.646843
91 -1.172471 -1.488513
10  0.728550  1.637609
36 -0.040357  2.050128
4   1.249646 -1.020366
60 -0.907925 -0.290945
34  0.029384  0.452658
38  1.566204 -1.171910
33 -1.009491  0.105400
62  0.930651 -0.124938
42  0.401900 -0.472175
80  1.266980 -0.221378
95 -0.171218  1.083018
74 -0.160058 -1.383118
28  1.257940  0.604513
87 -0.136468 -0.109718
27  1.909935 -0.712136
81 -1.449828 -1.823526
61  0.176301 -0.885127
53 -0.593061  1.547997
57 -0.527212  0.781028

在您的情况下，理想情况下它应该如下工作，但是，如果您正在定义 test_size，则不需要定义 train_size，反之亦然。

>>> train, test = train_test_split(data['ID_sample'], test_size=0.3)

或

>>> train, test = train_test_split(data['ID_sample'], test_size=0.3, random_state=5)

或

这将返回一个数组列表 ...

>>> train, test = train_test_split(data['ID_sample'].unique(), test_size=0.30, random_state=5)

【讨论】：