【问题标题】:random subset of fixed length such that each group is present at least N times固定长度的随机子集,使得每个组至少出现 N 次
【发布时间】:2018-02-17 19:21:59
【问题描述】:

我想从 df 的 column1 中的每个值中选择 5 行,以便输出对于 column2 中的每个唯一值至少有 1 个值。 输出中也不应该有任何重复

编辑: (column1, column3) 对中不应有重复项: 即对于 column1 中的每个值,column3 中的所有值都应该是唯一的

column1 = rep(c("a","b"), each = 12)
column2 = rep(c(1,2,3), each = 4)
column3 = c("x1","x2","x3","x4","x5","x3","x6","x7","x8","x1","x9","x5","x6","x2","x3","x4","x7","x5","x6","x1","x4","x1","x6","x9")

df = data.frame(column1, column2, column3)

这是一个有效的解决方案

sample_output_1 = data.frame(column1 = rep(c("a","b"), each = 5),
                         column2 = c(1,1,2,2,3,1,1,2,2,3),
                         column3 = c("x1","x2","x5","x3","x8","x6","x2","x5","x1","x9"))

【问题讨论】:

    标签: r random


    【解决方案1】:

    检查一下

    foo = function(a_df){
        inds = 1:NROW(a_df)     
        #Sample 5 indices along the rows of a_df
        my_inds = sample(inds, 5)       
        #If subset of a_df based on my_inds has duplicates
        #Or if 2nd column does not have all unique values
        while(any(duplicated(a_df[my_inds, c(1, 3)])) & 
            !identical(sort(unique(a_df[my_inds, 2])), sort(unique(a_df[[2]])))){
                #Count the number of duplicates or missing all values
                n = sum(duplicated(a_df[my_inds, c(1, 3)]))
                n = n + sum(!sort(unique(a_df[my_inds, 2])) %in% sort(unique(a_df[[2]])))
                #Remove my_inds from inds
                inds = inds[!inds %in% my_inds]
                #Remove the n indices that create duplicates from my_nds
                my_inds = my_inds[!duplicated(a_df[my_inds, c(1, 3)])]
                #Sample n more from inds and add to my_inds
                my_inds = sample(c(my_inds, sample(inds, n)))
            }
            return(a_df[my_inds,])
    }
    
    set.seed(42)
    do.call(rbind, lapply(split(df, df$column1), function(a) foo(a_df = a)))
         # column1 column2 column3
    # a.11       a       3      x9
    # a.12       a       3      x5
    # a.3        a       1      x3
    # a.8        a       2      x7
    # a.6        a       2      x3
    # b.19       b       2      x6
    # b.21       b       3      x4
    # b.14       b       1      x2
    # b.18       b       2      x5
    # b.23       b       3      x6
    

    【讨论】:

    • 这并不能确保 column2 中的每个值至少有 1 个值。 set.seed(200) do.call(rbind, lapply(split(df, df$column1), function(a) foo(a_df = a)))
    • 这行得通。谢谢。我会在几天内保持这个问题开放,以防有人知道任何可以快速实现此类问题的软件包
    猜你喜欢
    • 1970-01-01
    • 2013-10-31
    • 2020-03-11
    • 2012-01-18
    • 2018-06-16
    • 1970-01-01
    • 2021-10-27
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多