【问题标题】:Subset part of data based on values基于值的数据子集部分
【发布时间】:2026-01-18 20:25:01
【问题描述】:

我有一个关于退货产品的非常大的数据集,为了创建一个解释模型,我需要数据包含退回的一半产品 (1) 和未退回的一半产品 (0),因此它们以二进制形式给出变量。如何从数据中随机抽取这个子集?

以下是部分数据集

> dput(head(dat, 100))
structure(list(data5.order_id = c(24409499, 24409499, 37018675, 
49812254, 72349794, 121649820, 121649820, 123680104, 123680104, 
123680104, 156423543, 156423543, 156423543, 156423543, 156423543, 
156423543, 156423543, 156423543, 156423543, 156423543, 156423543, 
156423543, 156423543, 156423543, 156423543, 156423543, 156423543, 
156423543, 156423543, 156423543, 156423543, 156423543, 169218518, 
169218518, 169218518, 169218518, 169218518, 169218518, 169218518, 
169218518, 169218518, 169218518, 169218518, 169218518, 169218518, 
169218518, 169218518, 169218518, 169218518, 169218518, 198566821, 
198566821, 198566821, 198566821, 204651617, 204651617, 225070398, 
244297553, 244297553, 244297553, 244297553, 244297553, 244297553, 
264159404, 286533497, 302587170, 302587170, 302587170, 302587170, 
302587170, 302587170, 302587170, 302587170, 302587170, 302587170, 
302587170, 302587170, 302587170, 302587170, 302587170, 302587170, 
302587170, 302587170, 302587170, 302587170, 302587170, 302587170, 
302587170, 302587170, 302587170, 302587170, 308442395, 308442395, 
308442395, 312804245, 318656210, 360581093, 360581093, 381985214, 
381985214), data5.returnyesno = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 
1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 
0, 1, 0, 0, 0, 1, 1), data5.customer_id = c(3150040285, 3150040285, 
1437583473, 319353305, 620027539, 3023138737, 3023138737, 2519171220, 
2519171220, 2519171220, 4599523733, 4599523733, 4599523733, 4599523733, 
4599523733, 4599523733, 4599523733, 4599523733, 4599523733, 4599523733, 
4599523733, 4599523733, 4599523733, 4599523733, 4599523733, 4599523733, 
4599523733, 4599523733, 4599523733, 4599523733, 4599523733, 4599523733, 
1816785895, 1816785895, 1816785895, 1816785895, 1816785895, 1816785895, 
1816785895, 1816785895, 1816785895, 1816785895, 1816785895, 1816785895, 
1816785895, 1816785895, 1816785895, 1816785895, 1816785895, 1816785895, 
1131020953, 1131020953, 1131020953, 1131020953, 2335167491, 2335167491, 
1327858307, 330788549, 330788549, 330788549, 330788549, 330788549, 
330788549, 3230395728, 3888591660, 1158650034, 1158650034, 1158650034, 
1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 
1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 
1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 
1158650034, 1158650034, 1158650034, 1158650034, 1158650034, 908821356, 
908821356, 908821356, 1155228355, 684878789, 3389325926, 3389325926, 
1808359289, 1808359289)), row.names = c(NA, 100L), class = "data.frame")

【问题讨论】:

  • 请提供足够的代码,以便其他人更好地理解或重现问题。
  • 你希望看到什么来理解?我的数据集太大了,我无法提供整个数据集

标签: r subset


【解决方案1】:

您没有给出示例输入/输出,所以我创建了一个示例。

### create some fake data ###
library(data.table)
n = 10000

df1 = data.table(
    returned = as.logical(sample(c(0,1), replace=TRUE, size=n)), 
    some_other_variable = rnorm(n)
    )

## the size of the sampled dataset you want to create ##
sample_size = 1000

### select rows which have been returned and not returned ##
true_rows = sample(which(df1$returned == T), sample_size/2)
false_rows = sample(which(df1$returned == F), sample_size/2)

## subset these rows from the original
df2 = df1[c(true_rows, false_rows)]

【讨论】:

  • 抱歉,我确实忘记了,但是对于我的数据集,我还有 32 个其他列需要保存数据。然而现在的问题是,由于真正的行少于一半,它不能填充由 sample_size/2 创建的整个大小,这会产生错误。
  • @Estelle 您能否使用dput() 提供与真实数据相同比例的数据小样本。谢谢。
  • 这给出了一个非常大的输出,因为数据包含 157262 个观察值,有什么方法可以部分显示它吗?
  • 是的,试试dput(head(data, 100))
  • 我把它包括在内了!希望有用