正则表达式提取变量字符串答案

【问题标题】：Regular expression to pull out variable string正则表达式提取变量字符串
【发布时间】：2015-05-16 15:43:00
【问题描述】：

我在 PYthon 2.7 中有这个字符串列表：

list_a = ['temp_52_head sensor,
uploaded by TS','crack in the left quadrant, uploaded by AB, Left in 2hr
sunlight','FSL_pressure, uploaded by RS, no reported vacuum','art
9943_mercury, Uploaded by DY, accelerated, hurst potential too
low','uploaded by KKP, Space 55','avogadro reading level,
uploaded by HB, started mini counter, pulled lever','no comment
yesterday, Uploaded to TFG, level 1 escape but temperature stable,
pressure lever north']

在每个列表项中，都有一个字符串

uploaded by SOMEONE

我需要提取SOMEONE。

但是，如您所见，SOMEONE：

从列表中的一项更改为下一项。
长度可以是 2 或 3 个字符（仅文本，无数字）。
出现在字符串中的不同位置。
上传也发生在上传
上传有时出现在任何逗号之前

这是我需要提取的：

someone_names = ['TS','AB','RS','DY','KKP','HB','TFG']

我正在考虑使用正则表达式，但我面临的问题来自上面的第 2 点和第 3 点。

有没有办法从列表中提取这些字符？

【问题讨论】：

标签： regex string python-2.7 substring

【解决方案1】：

您可以使用列表推导来实现正则表达式。

>>> import re
>>> list_a = [
      'temp_52_head sensor, uploaded by TS',
      'crack in the left quadrant, uploaded by AB, Left in 2hr sunlight',
      'FSL_pressure, uploaded by RS, no reported vacuum',
      'art9943_mercury, Uploaded by DY, accelerated, hurst potential too low',
      'uploaded by KKP, Space 55',
      'avogadro reading level, uploaded by HB, started mini counter, pulled lever',
      'no comment yesterday, Uploaded to TFG, level 1 escape but temperature stable,pressure lever north'
]
>>> regex = re.compile(r'(?i)\buploaded\s*(?:by|to)\s*([a-z]{2,3})')
>>> names = [m.group(1) for x in list_a for m in [regex.search(x)] if m]
['TS', 'AB', 'RS', 'DY', 'KKP', 'HB', 'TFG']

【讨论】：

您好，这可行，但我对re.compile() 的经验很少，您能否解释一下这两行，尤其是第一行？
我有最后一个问题：即使我使用[a-z] 而不是[A-Z]，该方法仍然有效。你为什么用大写字母？
非常感谢！我所有的问题都得到了解答。我发现这个链接非常适合 re.compile()：diveintopython3.net/regular-expressions.html.

【解决方案2】：

不是正则表达式，但更详细的方法可能是这样的：

import re
name = re.search(re.escape("uploaded by ")+"(.*?)"+re.escape(","),list_a[x]).group(1)

【讨论】：

^^^^^ 我收到此错误消息TypeError: list indices must be integers, not str。

【解决方案3】：

看起来像这样的正则表达式符合您的要求，除非我遗漏了什么：

/[U|u]ploaded by ([A-Z]{2}|[A-Z]{3}),/

另外，看起来（从您的示例中）您也可以用逗号分割字符串，并从具有字符串“ploaded by”的数组中提取元素（避免上/下“u”的可能性），拆分它在空格上，然后取结果数组中的最后一个元素。

【讨论】：

【解决方案4】：

这个正则表达式会命中所有这些，如果你改变了上传者首字母中的字母数量，它仍然可以工作。无论两个或三个字母后是否有逗号或单引号，这都会匹配。它还将捕获您正在寻找的所有数据：

import re

m = re.compile('uploaded ((by)|(to)) ([a-z]+)', flags=re.IGNORCASE)

然后您可以将搜索模式对象m 与search() 函数一起使用，它将提取所有匹配项。每次迭代中的第 4 个匹配项就是您要查找的数据。

【讨论】：

嗨，这似乎是最简单的答案，但re.IGNORECASE 说module object has no attribute IGNORECASE。这不适用于 Python 2.7 吗？
啊，它需要flags=。修好了。
我对 Python 知之甚少，但我想知道您是否可以将字符串数组合并为一个长字符串，在其上调用 search( ) 函数，然后使用 .group 函数。我相信它会在组（4）中