【问题标题】:Randomly draw rows from dataframe based on unique values and column values根据唯一值和列值从数据框中随机绘制行
【发布时间】:2018-03-10 11:34:35
【问题描述】:

我有一个包含许多描述符变量(trt、个人、会话)的数据框。我希望能够随机选择可能的trt x individual 组合的一部分,但控制会话变量,以便没有随机拉取具有相同的会话号。这是我的数据框的样子:

trt <- c(rep(c(rep("A", 3), rep("B", 3), rep("C", 3)), 9))
individual <- rep(c("Bob", "Nancy", "Tim"), 27)
session <- rep(1:27, each = 3)
data <- rnorm(81, mean = 4, sd = 1)
df <- data.frame(trt, individual, session, data))
df
   trt individual session             data
1    A        Bob       1 3.72013685581385
2    A      Nancy       1 3.97225419000673
3    A        Tim       1 4.44714175686225
4    B        Bob       2 5.00024599458127
5    B      Nancy       2 3.43615965145765
6    B        Tim       2  6.7920094635501
7    C        Bob       3 4.36315054477571
8    C      Nancy       3 5.07117348146375
9    C        Tim       3 4.38503325758969
10   A        Bob       4 4.30677162933005
11   A      Nancy       4 1.89311687510669
12   A        Tim       4 3.09084920968413
13   B        Bob       5 3.10436190897144
14   B      Nancy       5 3.59454992439722
15   B        Tim       5 3.40778069131207
16   C        Bob       6 4.00171937800892
17   C      Nancy       6 0.14578811080644
18   C        Tim       6 4.20754733296227
19   A        Bob       7 3.69131009783284
20   A      Nancy       7  4.7025756891679
21   A        Tim       7 4.46196017363017
22   B        Bob       8 3.97573281432736
23   B      Nancy       8  4.5373185942686
24   B        Tim       8 2.40937847038141
25   C        Bob       9 4.57519884980087
26   C      Nancy       9 5.19143914630448
27   C        Tim       9 4.83144732833874
28   A        Bob      10 3.01769965527235
29   A      Nancy      10 5.17300616827746
30   A        Tim      10 4.65432284571663
31   B        Bob      11 4.50892032922527
32   B      Nancy      11 3.38082717995663
33   B        Tim      11 4.92022245677209
34   C        Bob      12 4.54149796547394
35   C      Nancy      12 3.21992774137179
36   C        Tim      12 3.74507360931023
37   A        Bob      13 3.39524949548056
38   A      Nancy      13 4.17518916890901
39   A        Tim      13 3.02932375225388
40   B        Bob      14 3.59660910672907
41   B      Nancy      14 2.08784850191654
42   B        Tim      14 3.98446125755258
43   C        Bob      15 4.01837496797085
44   C      Nancy      15 3.40610126858125
45   C        Tim      15 4.57107635588582
46   A        Bob      16 3.15839276840723
47   A      Nancy      16 2.19932140340504
48   A        Tim      16 4.77588798035668
49   B        Bob      17  4.3524768657397
50   B      Nancy      17 4.49071625925856
51   B        Tim      17 4.02576463486266
52   C        Bob      18 3.74783360762117
53   C      Nancy      18 2.84123227236184
54   C        Tim      18  3.2024114782253
55   A        Bob      19 4.93837445490921
56   A      Nancy      19  4.7103051496802
57   A        Tim      19 6.22083635045134
58   B        Bob      20  4.5177747677824
59   B      Nancy      20 1.78839270771153
60   B        Tim      20 5.07140678136995
61   C        Bob      21 3.47818616035335
62   C      Nancy      21 4.28526474048439
63   C        Tim      21 4.22597602946575
64   A        Bob      22 1.91700925257901
65   A      Nancy      22 2.96317997587458
66   A        Tim      22 2.53506974227672
67   B        Bob      23 5.52714403395316
68   B      Nancy      23  3.3618513551059
69   B        Tim      23 4.85869007113978
70   C        Bob      24  3.4367068543959
71   C      Nancy      24 4.47769879000349
72   C        Tim      24 5.77340483757836
73   A        Bob      25 4.78524317734622
74   A      Nancy      25 3.55373702554664
75   A        Tim      25 2.88541465503637
76   B        Bob      26 4.62885302019139
77   B      Nancy      26 3.59430293369092
78   B        Tim      26 2.29610255924296
79   C        Bob      27 4.38433001299722
80   C      Nancy      27 3.77825207859976
81   C        Tim      27 2.12163194694365

如何从每个 trt x individual 组合中提取 2 个具有唯一会话号的组合?这是我希望数据框看起来像的示例:

       trt individual session             data
    1    A        Bob       1 3.72013685581385
    5    B      Nancy       2 3.43615965145765
    7    C        Bob       3 4.36315054477571
    12   A        Tim       4 3.09084920968413
    15   B        Tim       5 3.40778069131207
    17   C      Nancy       6 0.14578811080644
    19   A        Bob       7 3.69131009783284
    29   A      Nancy      10 5.17300616827746
    31   B        Bob      11 4.50892032922527
    34   C        Bob      12 4.54149796547394
    39   A        Tim      13 3.02932375225388
    40   B        Bob      14 3.59660910672907
    47   A      Nancy      16 2.19932140340504
    51   B        Tim      17 4.02576463486266
    54   C        Tim      18  3.2024114782253
    59   B      Nancy      20 1.78839270771153
    71   C      Nancy      24 4.47769879000349
    81   C        Tim      27 2.12163194694365

我尝试了几件事,但都没有运气。

我尝试随机选择两个 trt x individual 组合,但最终得到重复的会话值:

setDT((df))
df[ , .SD[sample(.N, 2)] , keyby = .(trt, individual)]
    trt individual session             data
 1:   A        Bob      25  2.7560788894668
 2:   A        Bob      19 4.12040841647523
 3:   A      Nancy       4 5.35362338127901
 4:   A      Nancy      19 5.51636882737692
 5:   A        Tim      19 5.10553640201998
 6:   A        Tim       1 2.77380671625473
 7:   B        Bob      23 3.50585105164409
 8:   B        Bob       8 3.58167259470814
 9:   B      Nancy      23 2.85301307507985
10:   B      Nancy       8 2.85179395539781
11:   B        Tim      26 2.40666507132474
12:   B        Tim      20 3.31276311351286
13:   C        Bob      24 3.19076007024549
14:   C        Bob       3 3.59146613276121
15:   C      Nancy       9 4.46606667880457
16:   C      Nancy      15 2.25405252536256
17:   C        Tim      12 4.43111661206133
18:   C        Tim      27 4.23868848646589

我曾尝试随机选择每个会话编号之一,然后提取 2 个 trt x individual 组合,但由于随机选择没有获得相等数量的 trt x individual 组合,它通常会返回错误:

ind <- sapply( unique(df$session ) , function(x) sample( which(df$session == x) , 1) )
df.unique <- df[ind, ]
df.sub <- df.unique[, .SD[sample(.N, 2)] , by = .(trt, individual)]
Error in `[.data.frame`(df.unique, , .SD[sample(.N, 2)], by = .(trt, individual)) : 
  unused argument (by = .(trt, individual))

提前感谢您的帮助!

【问题讨论】:

    标签: r random data.table subset


    【解决方案1】:

    也许有一种聪明的采样方式,但这里有一个简单的想法可以让您同时开始:

    setDT(df)
    setkey(df, session)
    
    usedsessions = 0 # some value that's not a session number
    df[, {
           res = .SD[!.(usedsessions)][sample(.N, 2)]
           usedsessions = c(usedsessions, res$session)
           res
         }
       , by = .(trt, individual)]
    #    trt individual session     data
    # 1:   A        Bob       7 4.256668
    # 2:   A        Bob      25 2.431821
    # 3:   A      Nancy      16 4.785859
    # 4:   A      Nancy      19 4.865248
    # 5:   A        Tim       4 3.303689
    # 6:   A        Tim      13 3.550261
    # 7:   B        Bob      26 3.987136
    # 8:   B        Bob      17 3.283055
    # 9:   B      Nancy      14 3.177226
    #10:   B      Nancy       2 3.639542
    #11:   B        Tim       8 2.168447
    #12:   B        Tim       5 3.521123
    #13:   C        Bob      21 3.284245
    #14:   C        Bob      12 5.773098
    #15:   C      Nancy      24 4.624428
    #16:   C      Nancy       9 3.235467
    #17:   C        Tim      18 4.001395
    #18:   C        Tim      27 5.002110
    

    您可能需要添加极端情况处理(例如,如果没有此类采样)。

    【讨论】:

    • 当我尝试运行此程序时出现错误:bmerge 中的错误(i,x,leftcols,rightcols,io,xo,roll,rollends,nomatch,:x.'session' 是一个因素列连接到 i.'V1' 类型为 'double'。因子列必须连接到因子或字符列。
    • @broch 如错误所示 - 将 session 列的类更改为 numeric - df[, session := as.numeric(session)]。或者将 usedsessions 更改为一个字符 - usedsessions = "0"
    • 太好了,谢谢!我单独尝试 as.numeric(df$session) 而不是在 data.table 的上下文中。我的愚蠢错误。
    猜你喜欢
    • 1970-01-01
    • 2017-09-03
    • 1970-01-01
    • 1970-01-01
    • 2021-10-28
    • 2019-09-12
    • 2021-12-24
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多