数据表的连续行之间的R快速余弦距离答案

【问题标题】：R fast cosine distance between consecutive rows of a data.table数据表的连续行之间的R快速余弦距离
【发布时间】：2021-08-18 08:16:48
【问题描述】：

如何有效地计算 data.table 的大（约 4m 行）的（几乎）连续行之间的距离？我已经概述了我目前的方法，但它非常缓慢。我的实际数据最多有几百列。我需要计算滞后和领先以供将来使用，因此我创建了这些并使用它们来计算距离。

library(data.table)
library(proxy)

set_shift_col <- function(df, shift_dir, shift_num, data_cols, byvars = NULL){
  df[, (paste0(data_cols, "_", shift_dir, shift_num)) := shift(.SD, shift_num, fill = NA, type = shift_dir), byvars, .SDcols = data_cols]
}

set_shift_dist <- function(dt, shift_dir, shift_num, data_cols){
  stopifnot(shift_dir %in% c("lag", "lead"))
  shift_str <- paste0(shift_dir, shift_num)
  dt[, (paste0("dist", "_", shift_str)) := as.numeric(
    proxy::dist(
      rbindlist(list(
        .SD[,data_cols, with=FALSE], 
        .SD[, paste0(data_cols, "_" , shift_str), with=FALSE]
      ), use.names = FALSE), 
      method = "cosine")
  ), 1:nrow(dt)]
}

n <- 10000
test_data <- data.table(a = rnorm(n), b = rnorm(n), c = rnorm(n), d = rnorm(n))

cols <- c("a", "b", "c", "d")

set_shift_col(test_data, "lag", 1, cols)
set_shift_col(test_data, "lag", 2, cols)
set_shift_col(test_data, "lead", 1, cols)
set_shift_col(test_data, "lead", 2, cols)

set_shift_dist(test_data, "lag", 1, cols)

我确信这是一种非常低效的方法，任何建议都将不胜感激！

【问题讨论】：

标签： r data.table

【解决方案1】：

您没有在 proxy::dist 函数中使用矢量化效率 - 而不是为每一行调用一次，您可以通过一次调用获得所需的所有距离。

试试这个替换功能，比较一下速度：

set_shift_dist2 <- function(dt, shift_dir, shift_num, data_cols){
  stopifnot(shift_dir %in% c("lag", "lead"))
  shift_str <- paste0(shift_dir, shift_num)
  dt[, (paste0("dist2", "_", shift_str)) := proxy::dist(
    .SD[,data_cols, with=FALSE], 
    .SD[, paste0(data_cols, "_" , shift_str), with=FALSE], 
    method = "cosine", 
    pairwise = TRUE
  )]
}

您也可以一次性完成，而无需在表中存储数据副本

test_data[, dist_lag1 := proxy::dist(
  .SD, 
  as.data.table(shift(.SD, 1)), 
  pairwise = TRUE, 
  method = 'cosine'
  ), .SDcols = c('a', 'b', 'c', 'd')]

【讨论】：