替换 Python if-else 语句中的冗余正则表达式调用答案

【问题标题】：Replacement for redundant regex call in Python if-else statement替换 Python if-else 语句中的冗余正则表达式调用
【发布时间】：2020-11-16 08:02:45
【问题描述】：

这是一个按情况切换的样式代码，按预期运行。

我想做的是：

通过匹配多个正则表达式 AND，对候选对象进行排序，
通过正则表达式分组将其组件解析成片段。
正则表达式语法和短语都很好 - 所以最好专注于如何优化（或替换）if 子句本身。

我想在每个分支上重复调用完全相同的 re.match() 2 次是非常低效的。

在 Python 中是否有任何可能的替代或更复杂的方法来“重用” if 语句中使用的 re.match 对象？

尝试搜索最佳实践并阅读手册，但最终一无所获。

不能按照这里的建议分配 re.match() 值或使用 re.compile()，因为我有 elif 子句：

Redundant If Statement and Regex

我可以看到，从 Python 3.8 可以在 if 语句中分配一个变量，但我正在使用 Python 3.7。

How to assign a variable in an IF condition, and then return it?

如果您能提供帮助，我们将不胜感激。

提前致谢。

candidates = [
  'WTI CRUDE FUTURE Jul20',
  'Crude Oil Option C31',
  'O-CLK20_C43.00',
  'AMZN US 01/17/20 P1440',
  ...
]

for item in candidates:

  if re.match(r'([\w ]+) FUTURE (\w{3})(\d{2})', item):
     redundant_call = re.match(r'([\w ]+) FUTURE (\w{3})(\d{2})', item):
     Do something with .group(1), group(2) ...

  elif re.match(r'([\w ]+) Option (P|C)([\d\.]+)', item):
     redundant_call = re.match(r'([\w ]+) Option (P|C)([\d\.]+)', item):
     Do something with .group(1), group(2) ...

  elif re.match(r'O-(\w{2,3})([F-Z])(\d{2})_(P|C)([\d.]+)', item):
     redundant_call = re.match(r'O-(\w{2,3})([F-Z])(\d{2})_(P|C)([\d.]+)', item):
     Do something with .group(1), group(2) ...

...

【问题讨论】：

这里缺少选项是为什么 3.8引入了海象运算符。没有它就没有好的解决方案。你能做的最好的就是一堆嵌套的if/else case，因为你必须不断缩进，所以很快就会变得丑陋。

标签： python regex

【解决方案1】：

这就是添加海象运算符的原因；没有它就没有好的解决方案。保持短路和避免重新测试的唯一方法是嵌套if/else 块，这会导致丑陋的“箭头模式”代码：

for item in candidates:
    m = re.match(r'([\w ]+) FUTURE (\w{3})(\d{2})', item)
    if m:
        # Do something with m.group(1), m.group(2) ...
    else:
        m = re.match(r'([\w ]+) Option (P|C)([\d\.]+)', item)
        if m:
            # Do something with .group(1), group(2) ...
        else:
            m = re.match(r'O-(\w{2,3})([F-Z])(\d{2})_(P|C)([\d.]+)', item)
            if m:
                # Do something with .group(1), group(2) ...

要么这样做，要么升级到 3.8。唯一可用的简化是，如果您对任何匹配（组是可互换的）采取相同的操作，在这种情况下，模式上的单个内部循环就足够了，但这里看起来不是这种情况。

【讨论】：

【解决方案2】：

在每次迭代开始时将匹配项分配给一个变量。然后检查变量是否存储了该匹配的结果，然后遵循您已有的相同 if 结构。

for item in candidates:
    future = re.match(r'([\w ]+) FUTURE (\w{3})(\d{2})', item)
    option = re.match(r'([\w ]+) Option (P|C)([\d\.]+)', item)
    other = re.match(r'O-(\w{2,3})([F-Z])(\d{2})_(P|C)([\d.]+)', item)

    if future:
        print(future)
        # Do what you want here
    elif option:
        print(option)
        # Do what you want here
    elif other:
        print(other)
        # Do what you want here

这样你每次迭代只检查一次匹配。

【讨论】：

问题是，现在您正在无条件地测试所有三个正则表达式，即使第一个或第二个可能已经成功；如果第一个可能命中，那么在这种情况下，你的工作量是你的三倍。
好点@ShadowRanger，我想到了！以您拥有的方式检查会更清晰，尤其是在较大的数据集上。
另一种可能性是将三个正则表达式组合成一个，然后检查哪个组匹配。如果性能很重要，请同时使用re 和re2 进行测试，看看哪个最快。
如果涉及到这一点，组合正则表达式的性能可能比使用三个单独的正则表达式更好，因此这本身可能是更喜欢这种变体的一个原因。

【解决方案3】：

在这种情况下，另一种选择是将三个正则表达式合并为一个。根据数据，这甚至可能更高效（尤其是re2）；只有用真实数据测量才能回答这个问题。那时，只有一个 .match 调用，if/elif 语句将检查哪些组不是 None。

for item in candidates:
  m = re.match(
    r'(([\w ]+) (FUTURE (\w{3})(\d{2})|Option (P|C)([\d\.]+)))|(O-(\w{2,3})([F-Z])(\d{2})_(P|C)([\d.]+))',
    item
  )

  if m.group(3) is not None:
     Do something with .group(2), group(4) ...
  elif m.group(6) is not None:
     Do something with .group(2), group(6) ...
  elif m.group(8) is not None:
     Do something with .group(9), group(10) ...

与其他解决方案一样，您需要在进入循环之前编译正则表达式以获得性能。

【讨论】：

【解决方案4】：

一个可怕的（但可行的）hack 方法是创建一个辅助函数，它接受一个可变作为参数：

def match(regex, item, groups):
    m = re.match(regex, item)
    if m:
        groups[:] = m.groups()  # update the groups argument in-place
    return m

def process_data(candidates): 
    groups = []
    for item in candidates:

        if match(r'([\w ]+) FUTURE (\w{3})(\d{2})', item, groups):
            Do something with group[0], group[1] ...

        elif match(r'([\w ]+) Option (P|C)([\d\.]+)', item, groups):
            Do something with group[0], group[1] ...

        elif match(r'O-(\w{2,3})([F-Z])(\d{2})_(P|C)([\d.]+)', item, groups):
            Do something with group[0], group[1] ...

为了提高性能，您需要编译正则表达式，因此match 函数实际上会使用regex.match(item) 而不是re.match(regex, item)

【讨论】：

【解决方案5】：

如果处理可以通过相同的代码合理地完成，或者分解成单独的函数，我们也可以使用for循环：

dispatch_table = [
  (r'([\w ]+) FUTURE (\w{3})(\d{2})', handle_future),
  (r'([\w ]+) Option (P|C)([\d\.]+)', handle_option),
  (r'O-(\w{2,3})([F-Z])(\d{2})_(P|C)([\d.]+)', handle_other),
]

for item in candidates:
  for regex, handler in dispatch_table:
    m = re.match(regex, item)
    if m:
      handler(*m.groups())
      break
  else:
    raise ValueError("Unrecognised item: %s" % item)

【讨论】：