【发布时间】:2026-01-18 20:25:01
【问题描述】:
我有一个关于退货产品的非常大的数据集,为了创建一个解释模型,我需要数据包含退回的一半产品 (1) 和未退回的一半产品 (0),因此它们以二进制形式给出变量。如何从数据中随机抽取这个子集?
以下是部分数据集
> dput(head(dat, 100))
structure(list(data5.order_id = c(24409499, 24409499, 37018675,
49812254, 72349794, 121649820, 121649820, 123680104, 123680104,
123680104, 156423543, 156423543, 156423543, 156423543, 156423543,
156423543, 156423543, 156423543, 156423543, 156423543, 156423543,
156423543, 156423543, 156423543, 156423543, 156423543, 156423543,
156423543, 156423543, 156423543, 156423543, 156423543, 169218518,
169218518, 169218518, 169218518, 169218518, 169218518, 169218518,
169218518, 169218518, 169218518, 169218518, 169218518, 169218518,
169218518, 169218518, 169218518, 169218518, 169218518, 198566821,
198566821, 198566821, 198566821, 204651617, 204651617, 225070398,
244297553, 244297553, 244297553, 244297553, 244297553, 244297553,
264159404, 286533497, 302587170, 302587170, 302587170, 302587170,
302587170, 302587170, 302587170, 302587170, 302587170, 302587170,
302587170, 302587170, 302587170, 302587170, 302587170, 302587170,
302587170, 302587170, 302587170, 302587170, 302587170, 302587170,
302587170, 302587170, 302587170, 302587170, 308442395, 308442395,
308442395, 312804245, 318656210, 360581093, 360581093, 381985214,
381985214), data5.returnyesno = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
0, 1, 0, 0, 0, 1, 1), data5.customer_id = c(3150040285, 3150040285,
1437583473, 319353305, 620027539, 3023138737, 3023138737, 2519171220,
2519171220, 2519171220, 4599523733, 4599523733, 4599523733, 4599523733,
4599523733, 4599523733, 4599523733, 4599523733, 4599523733, 4599523733,
4599523733, 4599523733, 4599523733, 4599523733, 4599523733, 4599523733,
4599523733, 4599523733, 4599523733, 4599523733, 4599523733, 4599523733,
1816785895, 1816785895, 1816785895, 1816785895, 1816785895, 1816785895,
1816785895, 1816785895, 1816785895, 1816785895, 1816785895, 1816785895,
1816785895, 1816785895, 1816785895, 1816785895, 1816785895, 1816785895,
1131020953, 1131020953, 1131020953, 1131020953, 2335167491, 2335167491,
1327858307, 330788549, 330788549, 330788549, 330788549, 330788549,
330788549, 3230395728, 3888591660, 1158650034, 1158650034, 1158650034,
1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 1158650034,
1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 1158650034,
1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 1158650034,
1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 908821356,
908821356, 908821356, 1155228355, 684878789, 3389325926, 3389325926,
1808359289, 1808359289)), row.names = c(NA, 100L), class = "data.frame")
【问题讨论】:
-
请提供足够的代码,以便其他人更好地理解或重现问题。
-
你希望看到什么来理解?我的数据集太大了,我无法提供整个数据集