R如何使用向量加速模式匹配答案

【问题标题】：R how to speed up pattern matching using vectorsR如何使用向量加速模式匹配
【发布时间】：2022-01-11 03:18:59
【问题描述】：

我在一个数据框中有一列，其中包含城市和州名：

ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")

ac <- as.data.frame(ac)

我想在另一个数据框列d$description 中搜索ac$ac 中的值，如果匹配，则返回列id 的值。

dput(df)
structure(list(month = c(202110L, 201910L, 202005L, 201703L, 
201208L, 201502L), id = c(100559687L, 100558763L, 100558934L, 
100558946L, 100543422L, 100547618L), description = c("residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95", 
"digital video programming service multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission", 
"residential all distance telephone service  unlimited  voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission", 
"residential all distance telephone service  unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking", 
"local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125", 
"residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online"
)), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")

我尝试通过以下方法访问匹配项的行索引来做到这一点：

which(ac$ac %in% df$description)--这会返回 integer(0)。
grep(ac$ac, df$description, value = FALSE)--这会返回第一个索引 1。但这不是向量化的。
str_detect(string = ac$ac, pattern = df$description) -- 但这会返回所有不正确的 FALSE。

我的问题：如何在df$description中搜索ac$ac，并在匹配的情况下返回df$id的对应值？请注意，向量的长度不同。 我正在寻找所有匹配项，而不仅仅是第一个。我更喜欢简单快速的东西，因为我将使用的实际数据集每个都有超过 100k 行，但欢迎提出任何建议或想法。谢谢。

编辑。由于安德烈在下面的初步回答，问题的名称已更改以说明问题范围的变化。

编辑 (12/7)：增加赏金以产生额外的兴趣和快速、高效的可扩展解决方案。

编辑 (12/8)：澄清——我希望能够将 id 变量从 df 添加到 ac 数据框，如 ac$id。

【问题讨论】：

给出答案后改题。变量的名称已被替换。如果您更改了问题的重要部分，您最好在问题后面加上一个新块，否则自愿回答您问题的人似乎会因为他们的答案变得毫无意义而浪费时间。
@asd-tm 公平点。我应该更新我的问题。我现在已经编辑了。希望这就足够了。
我的笔记专门针对我关于变量名称的回答
我问是因为否则可以在向量而不是列表中捕获/收集结果。
@javlenti 我更新了我的答案。希望这是你现在所期望的。

标签： r string dataframe

【解决方案1】：

试试这个sapply 和grep。

df$id[ unlist( sapply( ac$ac, function(x) grep(x, df$description ) ) ) ]
[1] 100559687 100558946 100547618

编辑，从stringi 尝试stri_detect_regex。应该快 2-5 倍。

library(stringi)

df$id[ as.logical( rowSums( sapply( ac$ac, function(x) 
  stri_detect_regex( df$description, x ) ) ) ) ]
[1] 100559687 100558946 100547618

在具有 1.728M 行的扩展数据集上进行

微基准测试：
除非您使用的系统总内存小于 4Gb，否则内存应该不是问题。

nrow(df)
[1] 1728000

library(microbenchmark)

microbenchmark( 
  "grep1" = { res <- sapply(ac$ac, function(x) df$id[grep(x, df$description)]) },
  "grep2" = { res <- df$id[ unlist( sapply( ac$ac, function(x) grep(x, df$description ) ) ) ] },
  "stringi" = { res <- df$id[ as.logical( rowSums( sapply( ac$ac, function(x) stri_detect_regex( df$description, x ) ) ) ) ] }, times=10 )

Unit: seconds
   expr      min       lq      mean   median        uq       max neval cld
  grep1 96.90757 97.98706 100.13299 99.05837 101.99050 107.04312    10   b
  grep2 97.51382 97.66425 100.00610 99.20753 101.17921 106.86661    10   b
stringi 46.15548 46.65894  48.68073 47.29635  50.15713  53.50351    10  a

微基准测试期间的内存占用：
路径：/Library/Frameworks/R.framework/Versions/4.0/Resources/bin/exec/R
物理足迹：638.3M
物理足迹（峰值）：1.8G

【讨论】：

这似乎有效，但速度很慢
@asd-tm 感谢您的来信！我正在编辑，然后看到了变化。所以所有更新都在答案中。
@Andre 抱歉，我错误地将评论发布到您的答案中，而不是放在问题下方！
@asd-tm 不用担心，对于最近的答案了解他们的代码是否仍然有效有点帮助。我知道你的措辞是指 OP :)
我喜欢这个解决方案，因为它简单易读，但似乎不适用于规模化。当我尝试时，我收到了来自 R 的错误：cannot allocate vector of size 2 GB

【解决方案2】：

首先在提供的代码中没有c$c 赋值。所有数据都分配给一个名为c 的变量。此变量没有您尝试使用的任何 c 成员 (c$c)。

其次，将任何数据分配给称为 R c <- c(...) 的基本函数的变量是一种非常糟糕的做法。

【讨论】：

【解决方案3】：

也许这是一个选项？

ac$id <- sapply(ac$ac, function(x) d$id[grep(x, d$description)])
#                 ac        id
# 1 san francisco ca 100559687
# 2    pittsburgh pa 100558946
# 3  philadelphia pa          
# 4    washington dc          
# 5      new york ny          
# 6     aliquippa pa          
# 7   gainesville fl          
# 8     manhattan ks 100547618

【讨论】：

申请fixed = TRUE会快一点

【解决方案4】：

使用正则表达式和非昂贵函数进行检查应该很快：

首先，我们生成要检查的模式：ac_regex <- paste(ac$ac, collapse = "|")。

有几种方法可以检测description 和子集中的匹配项。以下是三个：

# 1 grep()
df[grep(ac_regex, df$description), ]["id"],
# 2 stringi::stri_detect_*()
df[stri_detect_regex(df$description, ac_regex), ]["id"],
# 3 stringr::str_detect() + tidy subsetting
df %>% filter(description %>% str_detect(ac_regex)) %>% select(id),

所有三个都返回所需的df 子集：

         id
1 100559687
2 100558946
3 100547618

（对于选项 2 和 3，您需要包 tidyverse 和 stringi。）

让我们进行基准测试（使用包bench）：

bench::mark(
  base_grep = df[grep(ac_regex, df$description), ]["id"],
  base_stringi = df[stringi::stri_detect_regex(df$description, ac_regex), ]["id"],
  tidy = df %>% filter(description %>% str_detect(ac_regex)) %>% select(id),
  check = F
)

  expression     median 
  <bch:expr>   <bch:tm>   
1 base_grep    146.61µs      
2 base_stringi  119.6µs     
3 tidy           1.99ms

我会选择stringi！

【讨论】：

由于某种原因，在整个数据帧上使用时会出现invalid regular expression 错误。此外还有一个警告：In grep(ac_regex, df$description): TRE pattern compilation error 'Out of memory'。当我有足够的 RAM 时，我看不出我是如何内存不足的。
这是因为paste0() 需要ac 是一个向量。我忘记在我的答案中包含这个。已更正

【解决方案5】：

最简单的解决方案通常是最快的！这是我的建议：

str = paste0(ac, collapse="|")
df$id[grep(str, df$description)]

但你也可以这样

df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]

或者这样

df$id[grepl(str, df$description, perl=T)]

但是，必须进行比较。顺便说一句，我添加了来自@Andre Wildberg 和@Martina C. Arnolda 的建议。以下是基准。

str = paste0(ac, collapse="|")
fFiolka1 = function() df$id[grep(str, df$description)]
fFiolka2 = function() df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]
fFiolka3 = function() df$id[grepl(str, df$description, perl=T)]

fWildberg1 = function() df$id[unlist(sapply(ac, function(x) grep(x, df$description)))]
fWildberg2 = function() df$id[as.logical(rowSums(sapply(ac, function(x) stri_detect_regex(df$description, x))))]

fArnolda1 = function() df[grep(str, df$description), ]["id"]
fArnolda2 = function() df[stringi::stri_detect_regex(df$description, str), ]["id"]
fArnolda3 = function() df %>% filter(description %>% str_detect(str)) %>% select(id)

library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fFiolka1(), fFiolka2(), fFiolka3(),
  fWildberg1(), fWildberg2(),
  fArnolda1(), fArnolda2(), fArnolda3(),
  times=100))

请注意，为简单起见，我将 ac 保留为向量！。

ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")

@jvalenti 的特别更新

好的。现在我更好地理解了你想要达到的目标。但是，为了充分展示最佳解决方案，我稍微修改了您的数据。他们来了

library(tidyverse)

ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
ac = tibble(ac = ac)

df = structure(list(
  month = c(202110L, 201910L, 202005L, 201703L, 201208L, 201502L), 
  id = c(100559687L, 100558763L, 100558934L, 100558946L, 100543422L, 100547618L), 
  description = c(
    "residential local telephone pittsburgh pa local with more san francisco ca flat rate with eas philadelphia pa plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95",
    "digital video san francisco ca pittsburgh pa  multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission",
    "residential all distance telephone pittsburgh pa unlimited voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission",
    "residential all distance telephone pittsburgh pa unlimited voice philadelphia pa san francisco ca pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking",
    "local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125",
    "residential public switched toll pittsburgh pa manhattan ks ks plan area residence switched toll base san philadelphia pa ca average revenue per minute 0 18 minute online"
  )), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")

您将在下面找到四种不同的解决方案。一种基于for 循环，两种解决方案基于dplyr 包中的函数，以及collapse 包中的函数。

fSolition1 = function(){
  id = vector("list", nrow(ac))
  for(i in seq_along(ac$ac)){
    id[[i]] = df$id[grep(ac$ac[i], df$description)]
  }
  ac %>% mutate(id = id) %>% unnest(id)
}
fSolition1()

fSolition2 = function(){
  ac %>% group_by(ac) %>% 
  mutate(id = list(df$id[grep(ac, df$description)])) %>% 
  unnest(id)
}
fSolition2()

fSolition3 = function(){
  ac %>% rowwise(ac) %>% 
  mutate(id = list(df$id[grep(ac, df$description)])) %>% 
  unnest(id)
}
fSolition3()

fSolition4 = function(){
ac %>%  
  collapse::ftransform(id = lapply(ac, function(x) df$id[grep(x, df$description)])) %>% 
  unnest(id)
}
fSolition4()

请注意，对于给定的数据，所有返回下表作为结果的函数

# A tibble: 12 x 2
   ac                      id
   <chr>                <int>
 1 san francisco ca 100559687
 2 san francisco ca 100558763
 3 san francisco ca 100558946
 4 pittsburgh pa    100559687
 5 pittsburgh pa    100558763
 6 pittsburgh pa    100558934
 7 pittsburgh pa    100558946
 8 pittsburgh pa    100547618
 9 philadelphia pa  100559687
10 philadelphia pa  100558946
11 philadelphia pa  100547618
12 manhattan ks     100547618

是时候进行基准测试了


library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fSolition1(), fSolition2(), fSolition3(), fSolition4(), times=100))

对于任何人来说，基于collapse 的解决方案是最快的，这可能并不奇怪。然而，第二名可能是一个很大的惊喜。基于 for 函数的旧解决方案排在第二位！ 还有人想说 for 很慢吗？

@Gwang-Jin Kim 的特别更新

对向量的操作没有太大变化。往下看。

df_ac = ac$ac
df_decription = df$description
df_id = df$id
fSolition5 = function(){
  id = vector("list", length = length(df_ac))
  for(i in seq_along(df_ac)){
    id[[i]] = df_id[grep(df_ac[i], df_decription)]
  }
  ac %>% mutate(id = id) %>% unnest(id)
}
fSolition5()

library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), times=100))

但是for 和ftransform 的组合可能会令人惊讶！！！

fSolition6 = function(){
  id = vector("list", nrow(ac))
  for(i in seq_along(ac$ac)){
    id[[i]] = df$id[grep(ac$ac[i], df$description)]
  }
  ac %>% collapse::ftransform(id = id) %>% unnest(id)
}
fSolition6()

library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), fSolition6(), times=100))

@jvalenti 的最新更新

亲爱的 jvaleniti，在您的问题中，您写道 我在一个数据框中有一列包含城市和州名，然后 我将使用超过 10 万行。我的结论是，给定城市很可能会在您的变量description 中出现多次。

但是，在您写的评论中 我不想更改 ac 中的行数 那么你期待什么样的结果呢？让我们看看可以用它做什么。

解决方案 1 - 我们将所有 id 作为向量列表返回

ac %>% collapse::ftransform(id = map(ac, ~df$id[grep(.x, df$description)])) 
# # A tibble: 8 x 2
# ac               id       
# * <chr>            <list>   
#   1 san francisco ca <int [3]>
#   2 pittsburgh pa    <int [5]>
#   3 philadelphia pa  <int [3]>
#   4 washington dc    <int [0]>
#   5 new york ny      <int [0]>
#   6 aliquippa pa     <int [0]>
#   7 gainesville fl   <int [0]>
#   8 manhattan ks     <int [1]>

解决方案 2 - 我们只返回第一个 id

ac %>% collapse::ftransform(id = map_int(ac, ~df$id[grep(.x, df$description)][1])) 
# # A tibble: 8 x 2
# ac                      id
# * <chr>                <int>
# 1 san francisco ca 100559687
# 2 pittsburgh pa    100559687
# 3 philadelphia pa  100559687
# 4 washington dc           NA
# 5 new york ny             NA
# 6 aliquippa pa            NA
# 7 gainesville fl          NA
# 8 manhattan ks     100547618

解决方案 3 - 我们只返回最后一个 id

ac %>%
  collapse::ftransform(id = map_int(ac, function(x) {
    idx = grep(x, df$description)
    ifelse(length(idx)>0, df$id[idx[length(idx)]], NA)})) 
# # A tibble: 8 x 2
# ac                      id
# * <chr>                <int>
# 1 san francisco ca 100558946
# 2 pittsburgh pa    100547618
# 3 philadelphia pa  100547618
# 4 washington dc           NA
# 5 new york ny             NA
# 6 aliquippa pa            NA
# 7 gainesville fl          NA
# 8 manhattan ks     100547618

解决方案 4 - 或者您可能想从所有可能的选项中选择任何 id

ac %>%
  collapse::ftransform(id = map_int(ac, function(x) {
    idx = grep(x, df$description)
    ifelse(length(idx)==0, NA, ifelse(length(idx)==1, df$id[idx], df$id[sample(idx, 1)]))})) 
# # A tibble: 8 x 2
# ac                      id
# * <chr>                <int>
# 1 san francisco ca 100558763
# 2 pittsburgh pa    100559687
# 3 philadelphia pa  100547618
# 4 washington dc           NA
# 5 new york ny             NA
# 6 aliquippa pa            NA
# 7 gainesville fl          NA
# 8 manhattan ks     100547618

解决方案 5 - 如果您不小心想查看所有 id 并希望同时保留 ac 行数

ac %>%
  collapse::ftransform(id = map(ac, function(x) {
    idx = grep(x, df$description)
    if(length(idx)==0) tibble(id = NA, idn = "id1") else tibble(
      id = df$id[idx],
      idn = paste0("id",1:length(id)))})) %>% 
  unnest(id) %>% 
  pivot_wider(ac, names_from = idn, values_from = id)
# # A tibble: 8 x 6
# ac                     id1       id2       id3       id4       id5
# <chr>                <int>     <int>     <int>     <int>     <int>
# 1 san francisco ca 100559687 100558763 100558946        NA        NA
# 2 pittsburgh pa    100559687 100558763 100558934 100558946 100547618
# 3 philadelphia pa  100559687 100558946 100547618        NA        NA
# 4 washington dc           NA        NA        NA        NA        NA
# 5 new york ny             NA        NA        NA        NA        NA
# 6 aliquippa pa            NA        NA        NA        NA        NA
# 7 gainesville fl          NA        NA        NA        NA        NA
# 8 manhattan ks     100547618        NA        NA        NA        NA

很遗憾，您提供的描述并未表明上述五种解决方案中的哪一种是您可以接受的解决方案。您必须自己决定。

【讨论】：

我需要将id 列添加到我原来的ac 数据框中。由于两者的长度不同，这将如何工作？
如果使用uniqe(ac$ac) 会怎样？
将其保留为向量或处理数据帧肯定会影响速度。
这很好，但它不返回原始数据帧，只返回匹配项。是否可以返回原始数据帧ac 和原始行数，id var 在没有匹配的行中附加空格或NA？我不想更改ac 中的行数。很抱歉造成混乱。
非常感谢您的帮助，Marek

【解决方案6】：

您可以使用包fuzzyjoin中的regex_inner_join

> library(fuzzyjoin)

> regex_inner_join(df, ac, by = c(description = "ac"))
   month        id
1 202110 100559687
2 201703 100558946
3 201502 100547618

                                                              description
1 residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95
2               residential all distance telephone service  unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking
3                                                                                                                                                                                      residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online
                ac
1 san francisco ca
2    pittsburgh pa
3     manhattan ks

【讨论】：