使用 str_detect 函数有条件地在 R 数据框中创建一个新列？答案

【问题标题】：Use str_detect function to conditionally create a new column in R dataframe?使用 str_detect 函数有条件地在 R 数据框中创建一个新列？
【发布时间】：2020-04-29 15:48:44
【问题描述】：

我有一个 A 列包含值的数据框：

**Channel**
Direct
Paid social
Organic social

我想要做的：创建一个名为 groupedChannel 的新列，其中 str_detect 在 A 列中搜索字符串以在 groupedChannel 中添加一个值。

Condition:
IF row in Column A matches regex "direct" THEN Column B value = "Direct" ELSE
IF row in Column B matches regex "social" THEN Column B value = "Social"

AFAIK，str_detect 将只返回 TRUE/FALSE。如何使用 TRUE/FALSE 在 B 列中分配值？

【问题讨论】：

标签： r

【解决方案1】：

使用基本 R 正则表达式函数的解决方案，也可以在 Channel 列中找不到直接和社交时处理

# Dummy data
data <- data.frame(Channel = c("Direct Paid", "Social", "Organic", "Social Organic"),
                   stringsAsFactors = F)

# Use sapply to iterate through each value in the 'Channel' column in the above dataframe
data$groupChannel <- sapply(data$Channel, FUN = function(x){
  # Use base R regex functions to for conditions, and return values for new column
  if (grepl("direct", tolower(x))){
    return("Direct")
  }else if (grepl("social", tolower(x))){
    return("Social")
  }else{
    return("Direct or Social Not Found")
  }
})

head(data)
  Channel               groupChannel
1    Direct Paid                     Direct
2         Social                     Social
3        Organic Direct or Social Not Found
4 Social Organic                     Social

【讨论】：

嗨，杰米。谢谢，那行得通。是否有来自基本 R 的 grep1 函数的 dplyr 等效项？

【解决方案2】：

我有一个基于条件替换的data.table 解决方案。它使用grepl，但如果你愿意，你可以使用stringr::str_detect：

library(data.table)
setDT(df)
df[, groupedChannel := "Social"]

# Conditional replacement
df[grepl("direct",colA), groupedChannel := "Direct"]

（解决方案未经测试）

【讨论】：

【解决方案3】：

您想要的是匹配您的正则表达式，而不仅仅是检测。

library(dplyr)
library(stringr)

tibble(
  colA = c("**Channel**", "Direct", "Paid social", "Organic social")
) %>% 
  mutate(
    colB = str_match(colA, "[Ss]ocial|[Dd]irect")[,1],
    colB = str_to_lower(colB)
  )
#> # A tibble: 4 x 2
#>   colA           colB  
#>   <chr>          <chr> 
#> 1 **Channel**    <NA>  
#> 2 Direct         direct
#> 3 Paid social    social
#> 4 Organic social social

^{由reprex package (v0.3.0) 于 2020 年 4 月 29 日创建}

stringr::str_match 返回一个矩阵，其中第一列是匹配项本身，后续列是多个组，因此我们需要在调用结束时添加[,1]。然后它匹配大小写版本，因此我们将所有匹配的组转换为小写。

或者，您可以像这样使用str_extract：colB = str_extract(colA, "[Ss]ocial|[Dd]irect"), 不带[,1]。

【讨论】：

【解决方案4】：

这是一个base R 解决方案，假设您有一组明确定义的Channel_group 值

数据：

data <- data.frame(Channel = c("Direct", "Paid social", "Organic social"),
                   stringsAsFactors = F)

您可以在向量a 中定义您的Channel_group 值：

a <- c("(S|s)ocial", "(D|d)irect")

现在您使用sub 将Channel 值替换为Channel_group 值； \\U 确保这些值以大写字符串形式返回（如果您喜欢使用小写字符串，请使用 \\L）：

data$Channel_group <- sub(paste0(".*\\b(", paste(a, collapse = "|"),")\\b.*"), "\\U\\1", data$Channel, perl = T)

结果：

data
         Channel Channel_group
1         Direct        DIRECT
2    Paid social        SOCIAL
3 Organic social        SOCIAL

【讨论】：