迭代字符串列表以提取子字符串答案

【问题标题】：Iterate over list of strings to pull out substrings迭代字符串列表以提取子字符串
【发布时间】：2021-04-14 11:28:30
【问题描述】：

我有一长串不同的字符串，它们都包含有关全球特定端口的一些信息。但是，每个端口名称都是不同的，并且包含在字符串中的不同位置。我想要做的是遍历所有字符串，找到单词'Port'，然后在'Port' 之后存储接下来的两个子字符串。例如：

'Strong winds may disrupt operations at the Port of Rotterdam on July 5'

我找到了'Port'，现在希望将'of Rotterdam' 作为一个完整的字符串添加到'Port'，例如'Port of Rotterdam'。我认为可以通过parts = my_str.split(' ') 来分割每个较长的字符串。那么：

for i in parts:
    if i == 'Port':
        new_str = i

但是，我不确定如何添加接下来的两个子字符串。想法？

【问题讨论】：

它实际上总是两个子字符串吗？
不，不一定。但我认为大多数时候应该是这样。我不确定我将如何控制它不是的情况。
您还可以匹配端口，后跟一个“单词”，也可以选择第二个，例如 \bPort\s+\S+(?:\s+\S+)? regex101.com/r/OatSlU/1
我已经进一步更新了我的答案，并使用了更好的正则表达式解决方案。

标签： python python-3.x string

【解决方案1】：

看看list.index（也记录在here）：

parts = my_str.split(' ')
try:
    port_index = parts.index('Port')
except ValueError:
    pass # Port name not found
else:
    port_name = ' '.join(parts[port_index:port_index + 2])

您当然可以进行更高级的处理。例如，抓取一系列大写单词，前面有一个of：

def find_name(sentence):
    """
    Get the port name or None.
    """
    parts = sentence.split(' ')
    try:
        start = parts.index('Port')
    except ValueError:
        return None
    else:
        if start == len(parts) - 1:
            return None

    end = start + 1
    if parts[end] == 'of':
        end = end + 1
    while end < len(parts) and parts[end][0].isupper():
        end += 1

    if end == start + 1 or (end == start + 2 and parts[start + 1] == 'of'):
        return None

    return ' '.join(parts[start:end])

当然你可以用正则表达式做同样的事情：

pattern = re.compile(r'Port(?:\s+of)?(\s+[A-Z]\S+)+')
match = pattern.search(my_str)
print(match.group())

此正则表达式无法正确匹配非拉丁大写字母。您可能需要调查解决方案 here 以获得足够的外部端口名称。

这里的两个解决方案都适用于以下两个测试用例：

'Strong winds may disrupt operations at the Port of Rotterdam on July 5'
'Strong winds may disrupt operations at the Port of Fos-sur-Mer on July 5'
'Strong winds may disrupt operations at Port Said on July 5'

您可能会进一步改进搜索，但这应该会为您提供获得良好开端的工具。在某些时候，如果句子变得足够复杂，您可能想要使用某种自然语言处理。例如，查看nltk 包。

【讨论】：

您最后的正则表达式解决方案非常简洁。有没有办法修改它，以便端口的名称可以包含破折号？例如，滨海福斯港。
@EliTurasky。当然。我已经编辑了答案。您可以将内置类添加到手动构建的类中：[\w-] 完全有效。我选择限制更少，只是将\w 更改为\S。

【解决方案2】：

.split() 创建一个列表，其中每个项目都是列表的一个新单词。然后遍历列表并找到带有“端口”的位置。如果找到了端口，则会创建一个新字符串。

parts = 'Strong winds may disrupt operations at the Port of Rotterdam on July 5'
words = parts.split()
new_str = None

for i, word in enumerate(words):
    if word == "Port":
        new_str = words[i + 1] + " " + words[i + 2]

if new_str:
    print(new_str)

【讨论】：

【解决方案3】：

另一种选择是使用模式来匹配端口，后跟一个由非空白字符组成的“单词”，如果第二个单词并不总是存在，则可以选择第二个单词。

\bPort\s+\S+(?:\s+\S+)?

\bPort 匹配 Port 前面有一个单词边界
\s+\S+ 匹配 1+ 个空白字符和 1+ 个非空白字符
(?:\s+\S+)? 可选匹配第二个单词

Regex demo

示例代码

import re
pattern = r"\bPort\s+\S+(?:\s+\S+)?"
s = "Strong winds may disrupt operations at the Port of Rotterdam on July 5"
print(re.findall(pattern, s))

输出

['Port of Rotterdam']

【讨论】：

这不适用于像"Strong winds may disrupt operations at Port Said on July 5" 这样的句子。不是 OP 的要求，但仍然是一个不错的选择。
@MadPhysicist 为什么它不起作用？ regex101.com/r/tQ1Lcm/1 问题状态 store the next two substrings after 'Port. 所以它确实有效。

【解决方案4】：

您可以使用列表推导来获取下一个标记 -

l = 'Strong winds may disrupt operations at the Port of Rotterdam on July 5 and Port of London is closed tomorrow'

tokens = l.split()
ports = [' '.join(tokens[i:i+3]) for i in range(len(tokens)) if tokens[i]=='Port']
print(ports)

['Port of Rotterdam', 'Port of London']

这种方法的好处是可以在同一个句子中找到多个端口。

【讨论】：