使用outer() 和which()。
set.seed(1)
n_rows <- 10
my.data <- data.frame(A = rnorm(n_rows), B = rnorm(n_rows))
mat <- which(outer(my.data[['A']], my.data[['B']], '>='), arr.ind = T)
colnames(mat) = c('index2', 'index1')
mat[, c('index1', 'index2')]
index1 index2
[1,] 1 4
[2,] 2 4
[3,] 2 7
[4,] 2 8
[5,] 2 9
[6,] 3 2
[7,] 3 4
[8,] 3 5
[9,] 3 7
... a total of 39 rows
我没有包含index3,因为它是一个常数。如果它总是-1,那么它没有多大用处。
通过将 ID 添加到原始 data.frame 并使用 lapply,我能够通过循环获得很大的速度。这让我可以跳过which 调用,也不必担心为Z 预分配
my.data$ID <- seq_len(nrow(my.data))
do.call(rbind
, lapply(seq_len(nrow(my.data))
, function (i) {
interim_result <- my.data[['ID']][my.data[i, "A"] >= my.data[['B']]]
if (length(interim_result) != 0) {
cbind(index1 = i,index2 = interim_result,index3 = -1)
}
}
)
)
最后,如果您进入data.table,您可以使用非等连接。
dt <- as.data.table(my.data)
dt[, ID := seq_len(.N)]
dt[dt
, on = .(A >= B)
, .(index1 = i.ID, index2 = ID, index3 = -1)
, allow.cartesian = T
]
性能
10行data.frame:
Unit: microseconds
expr min lq mean median uq max neval
original_loop 12607.5 12687.6510 13420.3960 12843.0520 13260.4010 17939.301 20
optim_loop 412.5 439.4515 695.5263 451.2510 462.0020 5345.802 20
dt_way 3053.0 3140.7510 3269.0610 3268.9010 3351.2010 3667.601 20
outer_statement 48.5 53.9005 65.7108 70.6505 72.7515 75.701 20
100 行数据帧:
Unit: microseconds
expr min lq mean median uq max neval
original_loop 42241.600 43560.001 48111.291 46051.7515 48297.301 79910.601 20
optim_loop 3888.601 4010.551 4775.211 4107.6010 4299.400 9010.601 20
dt_way 3356.902 3595.601 3857.906 3752.8505 3966.701 5330.101 20
outer_statement 304.901 312.401 344.661 332.5005 348.701 473.000 20
1,000 行 - 删除原始循环:
Unit: milliseconds
expr min lq mean median uq max neval
optim_loop 55.0290 58.18355 60.50015 60.08140 62.47300 66.6332 20
dt_way 29.1114 29.66050 32.19182 30.00790 30.88125 45.7993 20
outer_statement 24.2323 24.44935 26.87686 24.64055 27.48775 35.9967 20
10,000 行:
Unit: seconds
expr min lq mean median uq max neval
optim_loop 2.233144 2.277568 2.401055 2.382523 2.496764 2.615275 5
dt_way 3.622701 3.638953 3.660230 3.639226 3.649577 3.750691 5
outer_statement 3.250272 3.353263 3.369732 3.375544 3.409773 3.459810 5
我的电脑在那之后就崩溃了。令我惊讶的是,优化循环开始取得一些进展。