在字符串中查找单词序列答案

【问题标题】：Looking for Word Sequences Inside of a String在字符串中查找单词序列
【发布时间】：2021-05-06 13:44:18
【问题描述】：

没有看到有人在不使用正则表达式的情况下解决类似的问题。所以我有一个文本 = “我有三个苹果，因为昨天我买了三个苹果”和关键词列表：单词 = ['我有'，'三个'，'苹果'，'昨天'] 和 k = 2。我写了一个函数来查找并返回 >= k 'words' 的单词序列（单词是指识别为列表中单个元素的单词组合，例如'I have'被认为是一个单词）。

在这种情况下，它应该返回['我有三个苹果'，'三个苹果']。即使 'yesterday' 在字符串中，它是

我假设需要有一个堆栈来跟踪序列的大小。我开始编写代码时没有 1) 它在这种情况下不起作用，因为我尝试检查“我有”，然后“我有三个”等，但它不能只识别“三个苹果”； 2）我不知道如何进行。代码如下：

text = "I have three apples because yesterday I bought three apples"
words = ['I have', 'three', 'apples', 'yesterday']
k = 2

check1 = []


def search(text, words, k):
    for i in words:
        finding = text.count(i)
        if finding != 0:
            check1.append(i)
            check2 = ' '.join(check1)
            
            occurrences = text.count(check2)
            if occurrences > 0:
                #i want to check if the previous number of occurrences was the same 
                #that's why I think I need a stack. if it is, i keep going
                #if it's not, i append the previous phrases to the list if they're >= k and keep 
                #checking
                pass
            else:
                #the next word doesn't belong to the sequence, so we finish the process
                #by adding the right number of word sequences >= k to the resulting list
                pass
        else:
            #the word is not in the list and I need to add check2 to the list 
            #considering all word sequences
            pass

非常感谢解决此问题的不同方法或任何想法，因为我一直在尝试以这种方式解决它并且我不知道如何实现它。

【问题讨论】：

顺序重要吗？例如，“我有多少苹果？”的结果应该是什么？ ?
@PM77-1 是的，不幸的是，确实如此。所以它只能是“三个苹果”，而不是“三个苹果”。您的示例的结果将是一个空列表，因为只有 'apples' 和 'I have' 小于 k
对不起。这是一个更好的例子，“我只有三个苹果吗？”。
@PM77-1 在这里它会返回 ['三个苹果']。如果它是“我只有三个苹果是三个”，那么它将返回 ['三个苹果'，'我有三个']
您可以将单词列表转换为有向边的图形，指示允许跟随的单词。然后，您只需根据输入字符串遍历此图并记录您走过的允许路径。

标签： python python-3.x string list

【解决方案1】：

我通过浏览文本找到了解决方案，并以正确的顺序记下单词。但是，该算法的复杂性会随着单词列表的长度和文本的长度而迅速增加。根据应用程序，您可能希望采用不同的方式：

def walk(t,w,k):
    t+=' '
    node = -1
    current = []
    collection = []
    while len(t)>1:
        elong = False
        for i in range(len(w)):
            if i > node and t[:len(w[i])] == w[i]:
                    node = i
                    t = t[len(w[i])+1:]
                    current.append(w[i])
                    elong=True
        if not elong or len(t)<2:
            t = t[t.find(' ')+1:]
            if len(current)>=k: collection.append(' '.join(current))
            current = []
            node = -1
    return collection

此函数将处理您在问题中提到的请求如下：

#Input:
print(walk("I have three apples because yesterday I bought three apples",
           ['I have', 'three', 'apples', 'yesterday'],
           2))

#Output:
['I have three apples', 'three apples']

#Input:
print(walk("Is three apples all I have three",
           ['I have', 'three', 'apples', 'yesterday'],
           2))

#Output:
['three apples', 'I have three']

它严重依赖于分隔单词的空格，并且不能很好地处理标点符号。您可能希望包括一些预处理。

【讨论】：

感谢您的回答！它对我的问题帮助很大。我解决了重复的问题，因为如果输入是text = 'I have apples but no oranges I I I have I have apples and pears', words = ['I have', 'apples', 'and', 'pears'], k = 2，输出是['I have apples', 'apples and pears']，所以它会丢失第二个“我有”。简而言之，我通过在 while len(t)>1 之后检查如果 len(current) = 1，current[0] 不等于 t 中的下一个单词来修复它。再次感谢您！