在 Python 中匹配字符串的最有效方法是什么？答案

【问题标题】：What's the most efficient way to match strings in Python?在 Python 中匹配字符串的最有效方法是什么？
【发布时间】：2015-07-14 10:37:24
【问题描述】：

我需要在 Python 中为 Pig 数据转换作业编写一些用户定义的函数。为了描述这种情况，正在解析和馈送数据，Pig 脚本将为列中的每个数据字段调用这个 Python UDF。

大多数 UDF 在本质上是相似的，我需要从本质上将字符串与“某物 + 通配符”进行匹配。我知道regex 并且到目前为止一直在使用它，但在我进一步了解之前，我想确保这是一种匹配字符串的有效方法，因为脚本将迭代和调用 UDF 数千次。

举个例子：假设我们有一个字段需要匹配sales。该字段的可能值可能是任何值，因为源数据将来可能会变得古怪并随机附加一些内容并吐出saleslol。其他可能的值为sales.、salessales、sales.yes。

“销售”之后的内容无关紧要；如果它以sales 开头，那么我想抓住它。

那么下面的方法有效吗？ word 变量是销售列的输入或值。第一行是 Pig 脚本

@outputSchema("num:int")
def rule2(word):
  sales_match = re.match('sales', word, flags=re.IGNORECASE)

  if sales_match:
    return 1
  else:
    return 0

我有另一种情况，我需要匹配 4 个确切的已知字符串。这也有效吗？

@outputSchema("num:int")
def session1(word):
  if word in ['first', 'second', 'third', 'fourth']:
    return 1
  else:
    return 0

【问题讨论】：

你为什么要问它是否有效？您是否尝试过测试它？
.startswith() 是你的朋友...
@jonrsharpe 是的，我已经测试过了。但是我对这类东西的了解不足，我不知道其他方法可以进行所说的匹配，这就是我问的原因

标签： python regex string-matching

【解决方案1】：

你可以使用str.startswith():

>>> [s for s in 'saleslol. Other possible values are sales. salessales sales.yes'.split() if s
.lower().startswith('sales')]
['saleslol.', 'sales.', 'salessales', 'sales.yes']

您也不需要在 Python 中执行此操作：

if word in ['first', 'second', 'third', 'fourth']:
    return 1
else:
    return 0

相反，最好这样做：

def session1(word):
    return word in {'first', 'second', 'third', 'fourth'}

（注意集合文字与列表，但列表的语法相同）

对于测试前缀的形式，你的函数是：

def f(word):
    return word.startswith('sales')    # returns True or False

如果您想测试几个可能的单词，请使用any：

>>> def test(tgt, words):
...    return any(word.startswith(tgt) for word in words)
>>> test('sales', {'boom', 'blast', 'saleslol'})
True
>>> test('boombang', {'sales', 'boom', 'blast'})
False

反之，如果要测试多个前缀，使用startswith的元组形式：

>>> 'tenthhaha'.startswith(('first', 'second', 'third', 'fourth'))
False
>>> 'firstlol'.startswith(('first', 'second', 'third', 'fourth'))
True

【讨论】：

【解决方案2】：

实际上，由于某种原因，函数 A 似乎更快，我在每个函数上做了 100 万次循环，如果我的测量正确，第一个循环会快 20%


from pythonbenchmark import compare, measure

def session1_A(word):
  if word in ['first', 'second', 'third', 'fourth']:
    return 1
  else:
    return 0

def session1_B(word):
    return word in {'first', 'second', 'third', 'fourth'}

compare(session1_A, session1_B, 1000000, "fourth")

https://github.com/Karlheinzniebuhr/pythonbenchmark/

【讨论】：