R中的数百万个微小匹配：需要性能答案

【问题标题】：Millions of tiny matches in R : need performanceR中的数百万个微小匹配：需要性能
【发布时间】：2024-01-10 15:00:01
【问题描述】：

我有一百万长度的单词向量，称为 WORDS。我有一个名为 SENTENCES 的 900 万个对象列表。我列表中的每个对象都是一个句子，由 10-50 长度的单词向量表示。这是一个例子：

head(WORDS)
[1] "aba" "accra" "ada" "afrika" "afrikan" "afula" "aggamemon"

SENTENCES[[1]]
[1] "how" "to" "interpret" "that" "picture"

我想将列表中的每个句子转换为一个数值向量，其元素对应于句子单词在 WORDS 大向量中的位置。实际上，我知道如何使用该命令：

convert <- function(sentence){
  return(which(WORDS %in% sentence))
}

SENTENCES_NUM <- lapply(SENTENCES, convert)

问题是它花费的时间太长了。我的意思是我的 RStudio 爆炸了，尽管我有一台 16Go RAM 的计算机。那么问题是您有什么想法可以加快计算速度吗？

【问题讨论】：

你试过mclapply吗？
Thkx，但不，我在 Windows 上，我只有一个内核
另外你试过pmatch而不是which(..%in%..)吗？
我注意到您没有接受对您提出的任何问题的任何回答。尽管接受答案不是强制性的，但如果其中一个答案对您有用，那么这样做被认为是一种好的做法。这将为未来的读者提供有关解决方案价值的线索。另请参阅此帮助页面：What should I do when someone answers my question?
抱歉，我不知道这个 Jaap。没关系

标签： r performance join position match

【解决方案1】：

fastmatch 是 R 核心人员的一个小包，它对查找进行哈希处理，因此初始搜索，尤其是后续搜索更快。

您真正要做的是制作一个具有每个句子通用的预定义级别的因素。他的 C 代码中的缓慢步骤是对因子水平进行排序，您可以通过向他的因子函数的快速版本提供（唯一）因子水平列表来避免这种情况。

如果您只想要整数位置，您可以轻松地从因子转换为整数：很多都是不经意间这样做的。

您实际上根本不需要任何因素来满足您的需求，只需match。您的代码还会生成一个逻辑向量，然后从中重新计算位置：match 直接进入位置。

library(fastmatch)
library(microbenchmark)

WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]]

words_factor <- as.factor(WORDS)

# generate 100 sentences of between 5 and 15 words:
SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1))

bench_fun <- function(fun)
  lapply(SENTENCES, fun)

# poster's slow solution:
hg_convert <- function(sentence)
  return(which(WORDS %in% sentence))

jw_convert_match <- function(sentence) 
  match(sentence, WORDS)

jw_convert_match_factor <- function(sentence) 
  match(sentence, words_factor)

jw_convert_fastmatch <- function(sentence) 
  fmatch(sentence, WORDS)

jw_convert_fastmatch_factor <- function(sentence)
  fmatch(sentence, words_factor)

message("starting benchmark one")
print(microbenchmark(bench_fun(hg_convert),
                     bench_fun(jw_convert_match),
                     bench_fun(jw_convert_match_factor),
                     bench_fun(jw_convert_fastmatch),
                     bench_fun(jw_convert_fastmatch_factor),
                     times = 10))

# now again with big samples
# generating the SENTENCES is quite slow...
SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1))
message("starting benchmark two, compare with factor vs vector of words")
print(microbenchmark(bench_fun(jw_convert_fastmatch),
                     bench_fun(jw_convert_fastmatch_factor),
                     times = 10))

我把这个放在https://gist.github.com/jackwasey/59848d84728c0f55ef11

结果的格式不是很好，可以说，有或没有因子输入的快速匹配要快得多。

# starting benchmark one
Unit: microseconds
                                   expr         min          lq         mean      median          uq         max neval
                  bench_fun(hg_convert)  665167.953  678451.008  704030.2427  691859.576  738071.699  777176.143    10
            bench_fun(jw_convert_match)  878269.025  950580.480  962171.6683  956413.486  990592.691 1014922.639    10
     bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764    10
        bench_fun(jw_convert_fastmatch)     203.031     220.134     462.1246     289.647     305.070    2196.906    10
 bench_fun(jw_convert_fastmatch_factor)     251.474     300.729    1351.6974     317.439     362.127   10604.506    10

# starting benchmark two, compare with factor vs vector of words
Unit: seconds
                                   expr      min       lq     mean   median       uq      max neval
        bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648    10
 bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907    10

因此我暂时不会去麻烦并行实现。

【讨论】：

哦，谢谢。让我们假设我只想将世界映射为整数，无论这些整数是什么 - 因为实际上，我不在乎使用单词位置将句子转换为数字向量，你看到更简单的东西了吗？
我假设您希望在每个句子中用相同的数字表示相同的单词。如果不是这种情况，它确实会稍微简化问题，但我怀疑这就是你所追求的。
即使你不关心每个句子中单词的顺序，R 也会这样存储它们（因为没有等价于 C++ std::unsorted_set）。
是的，当然，同一个词对同一个整数。我刚试过。这真是令人印象深刻。我真的需要了解如何将 700K 秒提高到 3 秒！
我不是计算机科学家，但不得不学习一些性能技巧。正如我所理解的那样，散列是一种从元素本身派生包含集合元素的桶的方法。 en.wikipedia.org/wiki/Hash_table 是一个很好的起点。我唯一要补充的是，看看源代码是值得的，特别是如果某些东西比你需要的慢。您需要确切地知道 CPU 对数据执行了什么操作才能加快速度。在您的情况下，每次查找都会将大量数据从主内存转移到 CPU，这可能是瓶颈

【解决方案2】：

不会更快，但这是处理事情的整洁方式。

library(dplyr)
library(tidyr)

sentence = 
  data_frame(word.name = SENTENCES,
             sentence.ID = 1:length(SENTENCES) %>%
  unnest(word.name)

word = data_frame(
  word.name = WORDS,
  word.ID = 1:length(WORDS)

sentence__word = 
  sentence %>%
  left_join(word)

【讨论】：