Python 正则表达式在 2 个边缘案例上失败答案

【问题标题】：Python Regex Fails on 2 Edge CasesPython 正则表达式在 2 个边缘案例上失败
【发布时间】：2021-01-21 21:51:51
【问题描述】：

我正在尝试编写一个正则表达式来将字符串拆分为我所说的“术语”（例如单词、数字和周围的空格）和“逻辑运算符”（例如、、 , )。对于这个问题，我们可以忽略 AND、OR 和 NOT 的替代符号，分组仅使用 '(' 和 ')'。

例如：

Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)

应该拆分成这个 Python 列表：

["Frank and Bob are nice", "AND", "NOT", "(", "Henry is good", "OR", "Sam is 102 years old", ")"]

我的代码：

pattern = r"(NOT|\-|\~)?\s*(\(|\[|\{)?\s*(NOT|\-|\~)?\s*([\w+\s*]*)\s+(AND|&|OR|\|)?\s+(NOT|\-|\~)?\s*([\w+\s*]*)\s*(\)|\]|\})?"  
t = re.split(pattern, text)
raw_terms = list(filter(None, t))

该模式适用于这个测试用例，上面的一个，以及其他，

NOT Frank is a good boy AND Joe
raw_terms=['NOT', 'Frank is a good boy', 'AND', 'Joe']

但不是这些：

NOT Frank
raw_terms = ['NOT Frank']
NOT Frank is a good boy
raw_terms=['NOT Frank is a good boy']

我尝试将两个\s+ 更改为\s*，但并非所有测试用例都通过了。我不是正则表达式专家（这是我尝试过的最复杂的一个）。

我希望有人能帮助我理解为什么这两个测试用例会失败，以及如何修复正则表达式以使所有测试用例都通过。

谢谢，

标记

【问题讨论】：

标签： python-3.x regex python-re

【解决方案1】：

使用

re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)

见regex proof。

说明

--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      AND                      'AND'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      OR                       'OR'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      NOT                      'NOT'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [()]                     any character of: '(', ')'
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))

Python code:

import re
string = 'Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)'
output = re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)
output = list(filter(None, output))
print(output)

结果：['Frank and Bob are nice', 'AND', 'NOT', '(', 'Henry is good', 'OR', 'Sam is 102 years old', ')']

【讨论】：

很好的解决方案！我不完全理解它是如何工作的。如果您分组但不捕获 AND/OR/NOT/()，它如何最终出现在输出列表中？正则表达式让我的大脑受伤......
@user1045680 (\b(?:AND|OR|NOT)\b|[()]) 是一个捕获组，re.split 将所有匹配项添加到结果中。
我尝试在备用逻辑运算符（&、|、~、-）和备用分组字符（[]{}）中重新添加，但所有使用备用字符的测试都失败了。我试过 \s*(\b(?:AND|OR|NOT|&|\||-|~)\b|[(){}[]])\s* 我错过了什么？
我试用了您的解决方案，并能够添加回备用字符。新的正则表达式是\s*(\b(?:AND|OR|NOT)\b|[()&\~\-\|{}\[\]])\s*
@user1045680 使用\s*(\b(?:AND|OR|NOT)\b|[][()&~|{}-])\s*，见proof。