在 dplyr mutate 中传递单个列答案

【问题标题】：Passing a single column in dplyr mutate在 dplyr mutate 中传递单个列
【发布时间】：2018-01-21 22:56:13
【问题描述】：

我正在尝试使用 stringr 和 dplyr 来提取元音周围的字符。当我尝试下面的代码时，str_match 函数会抛出错误：

Error in mutate_impl(.data, dots) : 
  Column `near_vowel` must be length 150 (the number of rows) or one, not 450

最小示例代码：

library(tidyverse)
library(magrittr)
library(stringr)
iris %>%
  select(Species) %>%
  mutate(name_length = str_length(Species),
         near_vowel = str_match(Species, "(.)[aeiou](.)"))

我希望，例如“virginica”，它会提取“vir”、“gin”、“nic”。

【问题讨论】：

这不是那么简单，因为您提取的模式重叠，例如，gin 与 nic 重叠一个字母 n 而正则表达式不这样做。还有你对“abaaac”有什么期待？ ab, baa, aaa, aac?
您应该使用str_extract_all 而不是str_match。如果你想要“辅音 - 元音 - 辅音”，你的正则表达式可能应该类似于[^aeiou][aeiou][^aeiou]。但正如其他人所指出的，重叠是一个问题，例如“setosa”包含“set”和“tos”。

标签： r dplyr tidyverse stringr

【解决方案1】：

您需要解决一些问题，但是，鉴于您在问题中提供的内容，我将提出一个整洁的方法。

主要问题是您为near_vowel 每行返回多个值，我们可以通过嵌套结果来解决这个问题。其次，您需要 rowwise 处理才能使您的 mutate 变得明智......第三（如@Psidom 所述）您的 regex 不会产生您想要的输出。解决前两个问题，这是您问题的核心......

library(dplyr)
library(stringr)

df <- iris %>%
  select(Species) %>%
  mutate(
    name_length = str_length(Species),
    near_vowel = str_extract_all(Species, "[^aeiou][aeiou][^aeiou]")
  )

head(df)

#   Species name_length near_vowel
# 1  setosa           6        set
# 2  setosa           6        set
# 3  setosa           6        set
# 4  setosa           6        set
# 5  setosa           6        set
# 6  setosa           6        set

head(df[df$Species == "virginica", ]$near_vowel)

# [[1]]
# [1] "vir" "gin"
# 
# [[2]]
# [1] "vir" "gin"
# 
# [[3]]
# [1] "vir" "gin"
# 
# [[4]]
# [1] "vir" "gin"
# 
# [[5]]
# [1] "vir" "gin"
# 
# [[6]]
# [1] "vir" "gin"

编辑：使用@neilfws 提供的str_extract_all 方法进行了更新，这具有能够删除rowwise 操作的额外好处。

【讨论】：