【发布时间】:2017-06-28 00:50:08
【问题描述】:
我正在尝试研究如何在 Azure ML(因此 R 解决方案是可以接受的)中基于列随机拆分数据,以便该列中具有任何给定值的所有记录最终出现在分裂或另一个。例如:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 5678 | 0 | Bar | 1 |
| 9101112 | 1 | Quack | 1 |
| 13141516 | 1 | Meep | 1 |
| 1234 | 0 | Boop | 2 |
| 5678 | 0 | Baa | 2 |
| 9101112 | 0 | Bleat | 2 |
| 13141516 | 1 | Maaaa | 2 |
| 1234 | 0 | Foo | 3 |
| 5678 | 0 | Bar | 3 |
| 9101112 | 1 | Quack | 3 |
| 13141516 | 1 | Meep | 3 |
| 1234 | 1 | Boop | 4 |
| 5678 | 1 | Baa | 4 |
| 9101112 | 0 | Bleat | 4 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
如果我选择 50/50 拆分并根据学生 ID 列进行分组,则可接受的输出将是两个新数据集:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 1234 | 0 | Boop | 2 |
| 1234 | 0 | Foo | 3 |
| 1234 | 1 | Boop | 4 |
| 9101112 | 1 | Quack | 1 |
| 9101112 | 0 | Bleat | 2 |
| 9101112 | 1 | Quack | 3 |
| 9101112 | 0 | Bleat | 4 |
+------------+------+--------------------+------+
和
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 5678 | 0 | Bar | 1 |
| 5678 | 0 | Baa | 2 |
| 5678 | 0 | Bar | 3 |
| 5678 | 1 | Baa | 4 |
| 13141516 | 1 | Meep | 1 |
| 13141516 | 1 | Maaaa | 2 |
| 13141516 | 1 | Meep | 3 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
现在,据我所知,这基本上与分层拆分相反,它会得到一个随机样本,每个学生都代表双方。
我更喜欢执行此操作的 Azure ML 函数,但我认为这不太可能,因此是否有提供此类功能的 R 函数或库?我只能找到questions about stratification,这显然对我没有多大帮助。
【问题讨论】:
-
只有
sampleunique学生 ID 和带有%in%的子集行。 -
@alistaire 这是有道理的,我觉得很愚蠢,因为没有想到它:/如果您想添加它作为答案,我会接受它
标签: r random azure-machine-learning-studio