如何使用正则表达式在 case_when 语句中提取特定的字符串模式？答案

【问题标题】：How to extract specific string patterns in a case_when statement using regular expressions?如何使用正则表达式在 case_when 语句中提取特定的字符串模式？
【发布时间】：2026-02-15 22:05:01
【问题描述】：

考虑以下我基于 Donald Trump-Tweets 数据集创建的可重现数据集（可以找到 here）：

df <- tibble(target = c(rep("jeb-bush", 2), rep("jeb-bush-supporters", 2),
                        "jeb-staffer", rep("the-media", 5)),
             tweet_id = seq(1, 10, 1))

它由两列组成，推文的目标组和tweet_id：

# A tibble: 10 x 2
   target              tweet_id
   <chr>                  <dbl>
 1 jeb-bush                   1
 2 jeb-bush                   2
 3 jeb-bush-supporters        3
 4 jeb-bush-supporters        4
 5 jeb-staffer                5
 6 the-media                  6
 7 the-media                  7
 8 the-media                  8
 9 the-media                  9
10 the-media                 10

目标：

每当target 中的元素以jeb 开头时，我想提取- 之后的字符串模式。并且每当在以jeb 开头的元素中存在多个- 时，我想提取最后一个- 之后的字符串模式（在此示例数据集中只有jeb-bush-supporters 的情况）。对于不是以jeb 开头的每个元素，我只想创建字符串other。最后应该是这样的：

# A tibble: 10 x 3
   target              tweet_id new_var   
   <chr>                  <dbl> <chr>     
 1 jeb-bush                   1 bush      
 2 jeb-bush                   2 bush      
 3 jeb-bush-supporters        3 supporters
 4 jeb-bush-supporters        4 supporters
 5 jeb-staffer                5 staffer   
 6 the-media                  6 other     
 7 the-media                  7 other     
 8 the-media                  8 other     
 9 the-media                  9 other     
10 the-media                 10 other

我尝试过的：

我实际上已经设法使用以下代码创建了所需的输出：

df %>% 
    mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
                             str_extract(target, "(?<=[a-z]{3}-)[a-z]+"),
                               str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
                             str_extract(target, "(?<=[a-z]{3}-[a-z]{4}-)[a-z]+"),
                               TRUE ~ "other"))

但问题是这样的：

在第二个str_extract 语句中，我必须在“Positive Look behind”（[a-z]{4}）中定义确切的字母数量。否则 R 抱怨需要“有界的最大长度”。但是如果我不知道确切的模式长度或者它会因元素而异怎么办？

或者，我尝试使用捕获组而不是“环顾四周”。因此，我尝试包含 str_match 来定义我想要提取的内容而不是我不想提取的内容：

df %>% 
    mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
                             str_match(target, "jeb-([a-z]+)"),
                           str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
                             str_match(target, "jeb-[a-z]+-([a-z]+)"),
                           TRUE ~ "other"))

但随后我收到此错误消息：

Error: Problem with `mutate()` input `new_var`.
x `str_detect(target, "^jeb-[a-z]+$") ~ str_match(target, "jeb-([a-z]+)")`, `str_detect(target, "^jeb-[a-z]+-[a-z]+") ~ str_match(target, 
    "jeb-[a-z]{4}-([a-z]+)")` must be length 10 or one, not 20.
i Input `new_var` is `case_when(...)`.

问题：

最后，我想知道是否有一种简洁的方法可以在 case_when 语句中提取特定的字符串模式。当我无法使用“环顾四周”（因为我无法定义有界的最大长度）或捕获组（因为str_match 将返回一个向量长度为 20，而不是原始大小 10 或 1)？

【问题讨论】：

标签： r regex stringr

【解决方案1】：

一个选项是从case_when 中字符串的开头 (^) 检查带有 'jeb-' 子字符串的目标列，然后提取不是 - ([^-]+) 的字符字符串的结尾（$），否则（TRUE）返回“其他”

library(dplyr)
library(stringr)
df %>% 
    mutate(new_var = case_when(str_detect(target, '^jeb-')~ 
        str_extract(target, '[^-]+$'), TRUE ~ 'other'))

-输出

# A tibble: 10 x 3
#   target              tweet_id new_var   
#   <chr>                  <dbl> <chr>     
# 1 jeb-bush                   1 bush      
# 2 jeb-bush                   2 bush      
# 3 jeb-bush-supporters        3 supporters
# 4 jeb-bush-supporters        4 supporters
# 5 jeb-staffer                5 staffer   
# 6 the-media                  6 other     
# 7 the-media                  7 other     
# 8 the-media                  8 other     
# 9 the-media                  9 other     
#10 the-media                 10 other

我们还可以使用str_match 和coalesce 来简化此操作

df %>% 
   mutate(new_var = coalesce(str_match(target, '^jeb-.*?([^-]+)$')[,2], 'other'))

【讨论】：

谢谢！一个问题：coalesce()函数中正则表达式中?的作用是什么？
@N1loon 它与懒惰有关。您可以查看here