【发布时间】:2021-07-16 17:12:48
【问题描述】:
我有两个大表,每个表都包含一个“句子”列和一串单词。我很好奇哪些记录(真/假输出)的单词在任一列的任何句子中都可以找到。我的表非常大,下面的代码可能需要很长时间。有没有更快的方法来做到这一点?
谢谢!
# Determine if any "words" in either column of sentences match.
# Packages
library(tidyverse)
# Help functions
helper_in_2 <- function(b, a){
return(any(b %in% a))
}
helper_in <- function(a, b){
return(lapply(b, helper_in_2, a))
}
# Sample columns
sentence_col_a <- c("This is an example sentence.", "Here is another sample sentence?", "One more sentence that is not complicated.", "Last sentence to show an example!")
sentence_col_b <- c("Short string A.", "Another longer string.", "Final string example!")
# Extract words from each column
list_col_a <- str_to_lower(sentence_col_a) %>%
str_extract_all("[:alpha:]+")
list_col_b <- str_to_lower(sentence_col_b) %>%
str_extract_all("[:alpha:]+")
# Check for matches.
# (Code after first line isn't actually used in my code - it's just to show matches)
sapply(lapply(list_col_a, helper_in, list_col_b), as.numeric) %>%
t() %>%
as.data.frame() %>%
rename_at(vars(names(.)), function(x) sentence_col_b) %>%
mutate(rownames = sentence_col_a) %>%
tibble::column_to_rownames(var = "rownames")
输出:
| Sentences | Short string A. | Another longer string. | Final string example! |
|---|---|---|---|
| This is an example sentence. | 0 | 0 | 1 |
| Here is another sample sentence? | 0 | 1 | 0 |
| One more sentence that is not complicated. | 0 | 0 | 0 |
| Last sentence to show an example! | 0 | 0 | 1 |
Ronak 回答后更新
library(microbenchmark)
microbenchmark("Original method:" = sapply(lapply(list_col_a, helper_in, list_col_b), as.numeric),
"Ronak's method:" = sapply(list_col_a, function(x) as.integer(grepl(sprintf('\\b(%s)\\b', paste0(x, collapse = '|')), list_col_b))))
#Unit: microseconds
# expr min lq mean median uq max neval
#Original method: 72.9 76.65 88.082 82.35 86.1 173.9 100
# Ronak's method: 262.1 277.40 354.741 286.40 348.6 3724.3 100
【问题讨论】:
标签: r string performance