拆分为将连续分隔符视为一个的列答案

【问题标题】：Split into column treating successive delimiters as one拆分为将连续分隔符视为一个的列
【发布时间】：2020-02-25 07:07:39
【问题描述】：

您好，我想将 data.frame 的一列拆分为多列，但将连续的分隔符视为一列。我的输入是从文本文件中抓取的，所以使用不同的分隔符有点混乱，有时同一个分隔符会重复多次。在下面的示例中，我使用了空格、逗号、 “和”或破折号作为分隔符，但实际上我有超过 6 个不同的分隔符，包括几个单词（“and”和“incl”）。

我通常会使用tidyr::separate，但它没有组合连续分隔符的选项。试图列出该模式的可能组合的详尽列表很快就会变得荒谬，尤其是有时我可能连续有 4 或 5 个空格或逗号。

我在下面提供了一个代表和所需的输出（通过手动更改文本，这在我的 1000 行的真实数据中是不可行的）

数据：

library(tidyr)

testdf <- data.frame(test = c("This string has single spaces",
                              "This  one  has  double  spaces",
                              "This, has, comma,or space,   or ,both",
                              "This,one-, space,- comma -,and-dash"))

这些是我目前尝试使用的代码：

separate(testdf, test, into = letters[1:12], sep = " |,|-|and", fill = "right")

#> Warning: Expected 12 pieces. Additional pieces discarded in 2 rows [3, 4].
#>      a      b   c      d      e     f      g    h      i     j    k    l
#> 1 This string has single spaces  <NA>   <NA> <NA>   <NA>  <NA> <NA> <NA>
#> 2 This        one           has       double      spaces  <NA> <NA> <NA>
#> 3 This        has               comma     or             space          
#> 4 This        one               space                    comma

#sort of starting to work but gets very extensive very fast
separate(testdf, test, into = letters[1:12], sep = "  |, |, | |and|,", fill = "right")

#>      a      b    c      d      e    f     g     h    i     j    k    l
#> 1 This string  has single spaces <NA>  <NA>  <NA> <NA>  <NA> <NA> <NA>
#> 2 This    one  has double spaces <NA>  <NA>  <NA> <NA>  <NA> <NA> <NA>
#> 3 This    has       comma     or            space         or      both
#> 4 This        one-  space      -      comma     -      -dash <NA> <NA>

根据 Gregor 在我指定之前的回答，我需要单词分隔符：


separate(testdf, test, into = letters[1:12], sep = "[ ,-]+", fill = "right")
#>      a      b        c      d      e     f    g    h    i    j    k    l
#> 1 This string      has single spaces  <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 This    one      has double spaces  <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 This    has andcomma     or    and space   or both <NA> <NA> <NA> <NA>
#> 4 This    one    space    and  comma   and dash <NA> <NA> <NA> <NA> <NA>


###*Desired Output:*
```r
#>      a      b     c      d      e    f    g
#> 1 This string   has single spaces <NA> <NA>
#> 2 This    one   has double spaces <NA> <NA>
#> 3 This    has comma     or  space   or both
#> 4 This    one space  comma    dash <NA> <NA>

^{由reprex package (v0.3.0) 于 2019 年 10 月 30 日创建}

【问题讨论】：

\\s+|,|- 只处理多个空格，而不是其他空格的组合或倍数。
我对您的更新感到困惑 - 如果 and 是分隔符，为什么它仍在您想要的结果中（第 4 行）？
哦，这是一个错字/我没有彻底做出我想要的输出

标签： r regex string split tidyr

【解决方案1】：

正则表达式模式[ ,\\-] 匹配空格、逗号或破折号。添加+ 量词使其匹配一个或多个空格、逗号或破折号。这是您应该使用的模式。（我们将破折号 - 转义，因为在括号内，它可以是特殊字符，例如，"[a-z]" 匹配所有小写字母。确保您转义了模式中的任何其他特殊正则表达式字符。）

tidyr::separate(testdf, test, into = letters[1:12], sep = "[ ,\\-]+", fill = "right")
#      a      b     c      d      e    f    g    h    i    j    k    l
# 1 This string   has single spaces <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 2 This    one   has double spaces <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 3 This    has comma     or  space   or both <NA> <NA> <NA> <NA> <NA>
# 4 This    one space  comma    and dash <NA> <NA> <NA> <NA> <NA> <NA>

我通常会使用tidyr::separate，但它没有组合连续分隔符的选项

实际上，默认的sep 确实结合了连续的分隔符。默认模式为[^[:alnum:]]+，它是一个或多个非字母数字字符。对于此示例数据，所有不是字母的内容都是分隔符，因此默认值可以正常工作（但是，当然，您的真实数据可能更复杂，并且可能包含您不的标点符号想要分开，所以顶部的方法就是你想要的）。

tidyr::separate(testdf, test, into = letters[1:12], fill = "right")
# same output as above

如果你想花哨，使用stringr::str_count计算最大分隔符数量并相应定义into：

my_pattern = "[ ,\\-]+"
max_delim = max(stringr::str_count(testdf$test, pattern = my_pattern))
tidyr::separate(testdf, test, into = letters[1:(max_delim + 1)],
  fill = "right", sep = my_pattern)
#      a      b     c      d      e    f    g
# 1 This string   has single spaces <NA> <NA>
# 2 This    one   has double spaces <NA> <NA>
# 3 This    has comma     or  space   or both
# 4 This    one space  comma    and dash <NA>

【讨论】：

谢谢，这看起来很方便，但我应该包括我的分隔符之一是单词'and'，一个是'incl' 是否有人也可以添加一个单词？我将更新我的问题以反映这一点
在这种情况下，您可以使用这样的模式："( |,|-|and|incl)+"
Sweet "( |,|-|and|incl)+" 工作。你明白为什么我用圆括号加上加号，但原来的答案用了[]？ [] 不适用于这些单词。如果您修改答案以包含"( |,|-|and|incl)+"，我会接受。
方括号用于匹配单个字符。 [xyz] 是 (x|y|z) 的简写，但只允许使用单个字符。当您添加多字符匹配的要求时，我们不得不转向更长的语法。