将字符串拆分为单独的行，不包括某些模式匹配答案

【问题标题】：Split strings into separate rows excluding some pattern matches将字符串拆分为单独的行，不包括某些模式匹配
【发布时间】：2021-12-22 09:17:03
【问题描述】：

我有一个 data.frame，我想将 IV 列分隔成单独的行，以用于以逗号“”分隔的每一段文本，不包括括号之间以逗号为特征的那些文本段，例如",text(字符串,字符串,字符串),".

当前数据示例：

structure(list(Article.Title = "Random title", 
    Sample = "Sample information", 
    IV = "Union voice, HRM practices (participation, teams, incentives, development, recruitment), implict contracts, Crisis impact, dominant individual or family owner, no dominant individual or family owner, market growth, no market growth,", 
    Moderator = NA_character_, Mediator = NA_character_, DV = "Performance"), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"))

预期结果：

structure(list(Article.Title = c("Random title", "Random title", 
"Random title", "Random title", "Random title", "Random title", 
"Random title", "Random title"), Sample = c("Sample information", 
"Sample information", "Sample information", "Sample information", 
"Sample information", "Sample information", "Sample information", 
"Sample information"), IV = c("Union voice", "HRM practices (participation, teams, incentives, development, recruitment)", 
"implict contracts", "Crisis impact", "dominant individual or family owner", 
"no dominant individual or family owner", "market growth", "no market growth"
), Moderator = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"
), Mediator = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"
), DV = c("Performance", "Performance", "Performance", "Performance", 
"Performance", "Performance", "Performance", "Performance")), class = "data.frame", row.names = c(NA, 
-8L))

【问题讨论】：

标签： r

【解决方案1】：

我们可以在base R 和strsplit 中执行此操作，方法是将, 处的“IV”列拆分，而SKIPping 括号内的字符，然后replicate 行，如果数据由使用strsplit 创建的list 的lengths

lst1 <-  strsplit(df1$IV, "\\([^)]+(*SKIP)(*FAIL)|,\\s*", perl = TRUE)
df2 <- transform(df1[setdiff(names(df1), "IV")][rep(seq_len(nrow(df1)), 
        lengths(lst1)),], IV = unlist(lst1))[names(df1)]

-输出

> df2
  Article.Title             Sample                                                                         IV Moderator Mediator          DV
1  Random title Sample information                                                                Union voice      <NA>     <NA> Performance
2  Random title Sample information HRM practices (participation, teams, incentives, development, recruitment)      <NA>     <NA> Performance
3  Random title Sample information                                                          implict contracts      <NA>     <NA> Performance
4  Random title Sample information                                                              Crisis impact      <NA>     <NA> Performance
5  Random title Sample information                                        dominant individual or family owner      <NA>     <NA> Performance
6  Random title Sample information                                     no dominant individual or family owner      <NA>     <NA> Performance
7  Random title Sample information                                                              market growth      <NA>     <NA> Performance
8  Random title Sample information                                                           no market growth      <NA>     <NA> Performance

或在separate_rows 中使用相同的正则表达式（如在 cmets 中）

library(tidyr)
separate_rows(df1, IV, sep = "\\([^)]+(*SKIP)(*FAIL)|,\\s*")

-输出

# A tibble: 9 × 6
  Article.Title Sample             IV                                                                           Moderator Mediator DV         
  <chr>         <chr>              <chr>                                                                        <chr>     <chr>    <chr>      
1 Random title  Sample information "Union voice"                                                                <NA>      <NA>     Performance
2 Random title  Sample information "HRM practices (participation, teams, incentives, development, recruitment)" <NA>      <NA>     Performance
3 Random title  Sample information "implict contracts"                                                          <NA>      <NA>     Performance
4 Random title  Sample information "Crisis impact"                                                              <NA>      <NA>     Performance
5 Random title  Sample information "dominant individual or family owner"                                        <NA>      <NA>     Performance
6 Random title  Sample information "no dominant individual or family owner"                                     <NA>      <NA>     Performance
7 Random title  Sample information "market growth"                                                              <NA>      <NA>     Performance
8 Random title  Sample information "no market growth"                                                           <NA>      <NA>     Performance
9 Random title  Sample information ""                                                                           <NA>      <NA>     Performance

【讨论】：

tidyverse 解决方案：tidyr::separate_rows(df1, IV, sep = "\\([^)]+(*SKIP)(*FAIL)|,\\s*")
感谢@akrun & Phil 的回答，这些都非常有帮助！我主要使用 stringr + regex 和 Rstudio 的备忘单，在今天之前从未听说过 SKIP 和 FAIL。对于这些命令，您有推荐资源吗？
关于我的问题，这是一个很好的起点：stackoverflow.com/questions/24534782/…