【问题标题】:regex - match words with hyphen only uppercase正则表达式 - 仅用大写连字符匹配单词
【发布时间】:2020-04-07 18:01:54
【问题描述】:

我正在尝试匹配包含超过 1 个字母的单词并且: 全部大写,首字母小写,后面的字母大写,或者仅当所有字母都是大写时才在中间包含连字符。这是我的代码:

s = "ASCII, aSCII, AS-CII, AS-cii"

myset =   set(re.findall(r"\b[a-z]?[A-Z]+\-?[A-Z]{1,}",s))

Out[555]: {'AS', 'AS-CII', 'ASCII', 'aSCII'}

如您所见,不应返回 "AS",因为它在连字符后包含小写字母。我该如何解决这个问题?

试过了,结果报错:

myset = set(re.findall(r"\b[a-z]?[A-Z]+\-?[A-Z]+{1,}",s))

  File "<ipython-input-545-7bdc0c902553>"
    myset = set(re.findall(r"\b[a-z]?[A-Z]+\-?[A-Z]+{1,}",s))

  File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/re.py", line 222, in findall
    return _compile(pattern, flags).findall(string)

  File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)

  File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)

  File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_parse.py", line 855, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)

  File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_parse.py", line 416, in _parse_sub
    not nested and not items))

  File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_parse.py", line 619, in _parse
    source.tell() - here + len(this))

error: multiple repeat

【问题讨论】:

  • 你可以接受 2 Regex 吗?一个用于所有 uPERLOWER,一个用于 UPER-UPER ???
  • @Skapin 是的。我需要在我的任务中考虑到这一点

标签: python regex string


【解决方案1】:

您可以使用条件表达式

(...)?(if true than this|else this)

对于你的情况,这可能是

\b([a-z])?(?(1)[A-Z]+|[-A-Z]+[A-Z])(?!-)\b

a demo on regex101.com


分解这读
\b               # a word boundary
([a-z])?         # match a lower case letter if it is there
(?(1)            # if the lower case letter is there, match this branch
    [A-Z]+
|
    [-A-Z]+[A-Z] # else this one
)
(?!-)\b          # do not break at a -, followed by another boundary

【讨论】:

    【解决方案2】:

    我们来了

    res = [x[0] for x in re.findall(r"(([a-z]{1}[A-Z]+)|([A-Z]+\-[A-Z]+))",s)]
    print(res)
    print(set(res))
    

    给了

    ['aSCII', 'AS-CII']
    

    告诉我。我拆分为添加 OR 逻辑 |之间。

    【讨论】:

    • 这里不需要转义-。那么ASCII 呢?
    • 看起来不错...有没有办法在没有元组的情况下做到这一点? Jan 的评论也是有效的。
    • 我不这么认为,但这并不重要,理解列表会给你想要的输出
    【解决方案3】:

    以下正则表达式匹配所有提到的标准:

    \b[a-z]*[A-Z]+[\-A-Z]+[A-Z]+\b
    

    请在此处查看https://regex101.com/r/JNC4kN/1/

    但是,如果您给出此类示例,例如 aTH-THTH(连字符和大写后的小写字母),这将失败。如果您只想要 UPPER-UPPER,请遵循此正则表达式:

    \b[a-z]{0,1}(?<!\-)[A-Z]+\b(?!\-)|\b[A-Z]+\-[A-Z]+\b
    

    检查here

    【讨论】:

      【解决方案4】:

      您可以使用以下正则表达式,它涵盖了与前面或后面是连字符的单词有关的边缘情况(如下面的链接所示):

      (?<!\w|(?<=\w)-)(?:[a-zA-Z][A-Z]+|[A-Z]{2,}|[A-Z]+-[A-Z]+)(?!\w|-(?=\w))
      

      Demo

      Python 的正则表达式引擎执行以下操作。

      (?<!              # begin a negative lookbehind
        \w              # match word char
        |               # or
        (?<=\w)         # match a word char in a positive lookbehind
        -               # match '-'
      )                 # end negative lookbehind
      (?:               # begin non-cap grp
        [a-zA-Z][A-Z]+  # match a lc letter then 1+ uc letters
        |               # or
        [A-Z]{2,}       # match 2+ uc letters
        |               # or
        [A-Z]+-[A-Z]+   # match 1+ uc letters, '-', then 1+ uc letters
      )                 # end non-cap grp
      (?!               # begin negative lookahead
        \w              # match word char
        |               # or
        -               # match '-'
        (?=\w)          # match a word char in a positive lookahead
      )                 # end negative lookahead
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2011-06-03
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-07-08
        相关资源
        最近更新 更多