如何在 R 中为多列创建分层采样答案

【问题标题】：How to create Stratified Sampling for multiple columns in R如何在 R 中为多列创建分层采样
【发布时间】：2020-04-04 02:02:18
【问题描述】：

我的数据集有 821049 个变量和 18 列。我想为分层抽样取 9 列。这些是“BASKETS_NZ”、“PIS”、“PIS_AP”、“PIS_DV”、“PIS_PL”、“PIS_SDV”、“PIS_SHOPS”、“PIS_SR”、“QUANTITY”。我的分层变量是 ID = 1:821049。如何为我的变量选择区间？如何设置采样的大小？

dpt(rbind(头(WKA_ohneJB, 10), 尾(WKA_ohneJB, 10)))

structure(list(X = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 

821039L, 821040L, 821041L, 821042L, 821043L, 821044L, 821045L, 

821046L, 821047L, 821048L), BASKETS_NZ = c(1L, 1L, 1L, 1L, 1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 

LOGONS = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), PIS = c(71L, 39L, 50L, 4L, 

13L, 4L, 30L, 65L, 13L, 31L, 111L, 33L, 3L, 46L, 11L, 8L, 

17L, 68L, 65L, 15L), PIS_AP = c(14L, 2L, 4L, 0L, 0L, 0L, 

1L, 0L, 2L, 1L, 13L, 0L, 0L, 2L, 1L, 0L, 3L, 8L, 0L, 1L), 

PIS_DV = c(3L, 19L, 4L, 1L, 0L, 0L, 6L, 2L, 2L, 3L, 38L, 

8L, 0L, 5L, 2L, 0L, 1L, 0L, 3L, 2L), PIS_PL = c(0L, 5L, 8L, 

2L, 0L, 0L, 0L, 24L, 0L, 6L, 32L, 8L, 0L, 0L, 4L, 0L, 0L, 

0L, 0L, 0L), PIS_SDV = c(18L, 0L, 11L, 0L, 0L, 0L, 0L, 0L, 

0L, 1L, 6L, 0L, 0L, 13L, 0L, 0L, 1L, 15L, 1L, 0L), PIS_SHOPS = c(3L, 

24L, 13L, 3L, 0L, 0L, 6L, 28L, 2L, 11L, 71L, 16L, 2L, 5L, 

6L, 0L, 1L, 0L, 3L, 2L), PIS_SR = c(19L, 0L, 14L, 0L, 0L, 

0L, 2L, 23L, 0L, 3L, 6L, 0L, 0L, 20L, 0L, 0L, 3L, 32L, 1L, 

0L), QUANTITY = c(13L, 2L, 18L, 1L, 14L, 1L, 4L, 2L, 5L, 

1L, 5L, 2L, 2L, 4L, 1L, 3L, 2L, 8L, 17L, 8L), WKA = c(1L, 

1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 

0L, 0L, 1L, 1L), NEW_CUST = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 

0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), EXIST_CUST = c(1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 

1L, 1L, 1L, 1L), WEB_CUST = c(1L, 0L, 0L, 0L, 1L, 1L, 0L, 

1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), MOBILE_CUST = c(0L, 

1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 

1L, 0L, 1L, 0L), TABLET_CUST = c(0L, 0L, 0L, 0L, 0L, 0L, 

0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L), 

LOGON_CUST_STEP2 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 

0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(1L, 

2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 821039L, 821040L, 821041L, 

821042L, 821043L, 821044L, 821045L, 821046L, 821047L, 821048L

), class = "data.frame")

【问题讨论】：

这能回答你的问题吗：stackoverflow.com/questions/57924068/…
@ Dave2e 我认为它正朝着正确的方向发展。我将在函数中的哪里插入 group by？我的任务是识别用户的在线行为。变量代表每个产品页面的页面浏览量，购物篮的数量。从描述性统计和图表可以看出，变量的分布是右偏的。如何考虑区间内变量的不均匀分布以及如何选择区间和抽样规模？

标签： r cluster-analysis sampling

【解决方案1】：

这是一种基于多列执行分层抽样的解决方案。在实施此操作之前，请考虑您的数据是连续的并且足够大，仅随机抽样就足够了。

解决这个问题的方法是从每个组中抽取一个分层样本。将数据组合在一起的潜在方法是将 9 列粘贴在一起或使用 dplyr 的 groupby 函数。

使用解决方案是这个问题How to get around error "factor has new levels" in cross-validation glm? 并使用 dplyr 样式进行更新。

此 dplyr_stratified 函数将采用所需的采样率和任意数量的列，并将返回带有采样行的数据帧。请参阅下面的示例以获取 2 列。

set.seed(1)
x <- rnorm(n = 100)
y <- rep(x = c("A","B"), times = c(50,50))
z <- rep(x = c("D","E","F"), times = c(33,33,34))
data <- data.frame(x, y=sample(y, replace = TRUE), z=sample(z, replace=TRUE))

library(dplyr)
#optional tag row for later identification: 
data$rowid<-1:nrow(data)
dplyr_stratified <- function(df, percent, ...){
  columns<-enquos(...)
   #group then sample each group
  out<-df %>% group_by(!!!columns)  %>% slice( sample(1:n(), percent*n())) 
}

testgroup<-dplyr_stratified(data, 0.8, z, y)
testgroup

注意：这是假设每个分组都有足够数量的样本来选择具有代表性的样本。（如果组太小，那么这种方法可能达不到预期）

【讨论】：