【问题标题】:Python Regex Fails on 2 Edge CasesPython 正则表达式在 2 个边缘案例上失败
【发布时间】:2021-01-21 21:51:51
【问题描述】:

我正在尝试编写一个正则表达式来将字符串拆分为我所说的“术语”(例如单词、数字和周围的空格)和“逻辑运算符”(例如 , )。对于这个问题,我们可以忽略 AND、OR 和 NOT 的替代符号,分组仅使用 '(' 和 ')'。

例如:

Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)

应该拆分成这个 Python 列表:

["Frank and Bob are nice", "AND", "NOT", "(", "Henry is good", "OR", "Sam is 102 years old", ")"]

我的代码:

pattern = r"(NOT|\-|\~)?\s*(\(|\[|\{)?\s*(NOT|\-|\~)?\s*([\w+\s*]*)\s+(AND|&|OR|\|)?\s+(NOT|\-|\~)?\s*([\w+\s*]*)\s*(\)|\]|\})?"  
t = re.split(pattern, text)
raw_terms = list(filter(None, t))

该模式适用于这个测试用例,上面的一个,以及其他,

NOT Frank is a good boy AND Joe
raw_terms=['NOT', 'Frank is a good boy', 'AND', 'Joe']

但不是这些:

NOT Frank
raw_terms = ['NOT Frank']
NOT Frank is a good boy
raw_terms=['NOT Frank is a good boy']

我尝试将两个\s+ 更改为\s*,但并非所有测试用例都通过了。我不是正则表达式专家(这是我尝试过的最复杂的一个)。

我希望有人能帮助我理解为什么这两个测试用例会失败,以及如何修复正则表达式以使所有测试用例都通过。

谢谢,

标记

【问题讨论】:

    标签: python-3.x regex python-re


    【解决方案1】:

    使用

    re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)
    

    regex proof

    说明

    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      (                        group and capture to \1:
    --------------------------------------------------------------------------------
        \b                       the boundary between a word char (\w)
                                 and something that is not a word char
    --------------------------------------------------------------------------------
        (?:                      group, but do not capture:
    --------------------------------------------------------------------------------
          AND                      'AND'
    --------------------------------------------------------------------------------
         |                        OR
    --------------------------------------------------------------------------------
          OR                       'OR'
    --------------------------------------------------------------------------------
         |                        OR
    --------------------------------------------------------------------------------
          NOT                      'NOT'
    --------------------------------------------------------------------------------
        )                        end of grouping
    --------------------------------------------------------------------------------
        \b                       the boundary between a word char (\w)
                                 and something that is not a word char
    --------------------------------------------------------------------------------
       |                        OR
    --------------------------------------------------------------------------------
        [()]                     any character of: '(', ')'
    --------------------------------------------------------------------------------
      )                        end of \1
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    

    Python code:

    import re
    string = 'Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)'
    output = re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)
    output = list(filter(None, output))
    print(output)
    

    结果['Frank and Bob are nice', 'AND', 'NOT', '(', 'Henry is good', 'OR', 'Sam is 102 years old', ')']

    【讨论】:

    • 很好的解决方案!我不完全理解它是如何工作的。如果您分组但不捕获 AND/OR/NOT/(),它如何最终出现在输出列表中?正则表达式让我的大脑受伤......
    • @user1045680 (\b(?:AND|OR|NOT)\b|[()]) 一个捕获组,re.split 将所有匹配项添加到结果中。
    • 我尝试在备用逻辑运算符(&、|、~、-)和备用分组字符([]{})中重新添加,但所有使用备用字符的测试都失败了。我试过 \s*(\b(?:AND|OR|NOT|&|\||-|~)\b|[(){}[]])\s* 我错过了什么?
    • 我试用了您的解决方案,并能够添加回备用字符。新的正则表达式是\s*(\b(?:AND|OR|NOT)\b|[()&\~\-\|{}\[\]])\s*
    • @user1045680 使用\s*(\b(?:AND|OR|NOT)\b|[][()&~|{}-])\s*,见proof
    猜你喜欢
    • 1970-01-01
    • 2012-09-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-04-24
    • 2020-05-16
    • 2020-11-21
    • 2019-11-09
    相关资源
    最近更新 更多