【问题标题】:How to split a string and match its substrings to a list of substrings? - Python如何拆分字符串并将其子字符串与子字符串列表匹配? - Python
【发布时间】:2013-03-07 13:27:42
【问题描述】:

我需要在不改变字符顺序的情况下将字符串拆分为所有可能的方式。 我知道这个任务可以被视为 NLP 中的标记化或词形还原,但我正在从更简单、更健壮的纯字符串搜索角度尝试它。鉴于,

dictionary = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"

任务 1:我如何生成所有可能的子字符串,以便我得到:

all_possible_substrings = [['f','iretrainstation'],
['fo','retrainstation'], ...
['firetrainstatio','n'],
['f','i','retrainstation'], ... , ...
['fire','train','station'], ... , ...
['fire','tr','a','instation'], ... , ...
['fire','tr','a','in','station'], ... , ...
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']

任务 2: 然后从all_possible_substring,我如何检查并说包含字典中所有元素的子字符串集是正确的输出。所需的输出将是字典中从左到右匹配最多字符的子字符串列表。所需的输出是:

"".join(desire_substring_list) == str1 and \
[i for i desire_substring_list if in dictionary] == len(desire_substring_list)
#(let's assume, the above condition can be true for any input string since my english
#language dictionary is very big and all my strings are human language 
#just written without spaces)

期望的输出:

'fire','train','station'

我做了什么?

对于任务 1,我已经这样做了,但我知道它不会给我所有可能的空白插入:

all_possible_substrings.append(" ".join(str1))

我已经这样做了,但这只是任务 2

import re
seed = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"
all_possible_string = [['f','iretrainstation'],
['fo','retrainstation'],
['firetrainstatio','n'],
['f','i','retrainstation'], 
['fire','train','station'], 
['fire','tr','a','instation'], 
['fire','tr','a','in','station'], 
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']]
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
highest_match = ""
for i in all_possible_string:
  x = pattern.findall(" ".join(i))
  if "".join(x) == str1 and len([i for i in x if i in seed]) == len(x):
    print " ".join(x)

【问题讨论】:

  • 请注意,您的字典实际上是list
  • 另外,我很确定你需要做更多的解释。为什么 `'foo','bar','bar','str' 是所需的输出?
  • 更新了所需的输出。
  • 在这种情况下是否更清楚?
  • 如何从dictionary 获得str1?而且我可能会误解,但是“字典中从左到右匹配最多字符的子字符串列表”是否总是str1减去最后一个字母? (假设您不想要整个字符串。)

标签: python string dictionary substring string-matching


【解决方案1】:

对于第一部分,您可以编写一个类似于此的递归生成器:

>>> def all_substr(string):
    for i in range(len(string)):

        if i == len(string) - 1:
            yield string

        first_part = string[0:i+1]
        second_part = string[i+1:]

        for j in all_substr(second_part):
            yield ','.join([first_part, j])


>>> for x in all_substr('apple'):
    print(x)


a,p,p,l,e
a,p,p,le
a,p,pl,e
a,p,ple
a,pp,l,e
a,pp,le
a,ppl,e
a,pple
ap,p,l,e
ap,p,le
ap,pl,e
ap,ple
app,l,e
app,le
appl,e
apple

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2013-06-18
    • 2015-09-01
    • 2023-03-28
    • 2021-11-30
    • 2019-07-27
    • 1970-01-01
    • 2019-05-22
    • 1970-01-01
    相关资源
    最近更新 更多