删除自定义停用词形成python中的短语答案

【问题标题】：removing custom stop words form a phrase in python删除自定义停用词形成python中的短语
【发布时间】：2015-05-30 00:34:45
【问题描述】：

我正在尝试在进一步处理输入之前从用户输入中删除某些短语和单词，并且在尝试执行此操作时遇到了“索引超出范围”错误的问题并且完全卡住了。我该如何解决这个问题？我将输入短语作为字符串转换为列表以比较每个单词，并将停用词作为预定义列表。
示例输入：
[“好”、“你”、“知道”、“那个”、“天气”、“是”、“糟糕”]
["you", "know", "what", "i", "mean", "so", "just", "turn", "the", "lights", "on"]

#Gets user input and removes the selected stop words from it and returns a filtered phrase back.    
def stop_word_remover(phrase_list):

    stop_words_lst = ["yo", "so", "well", "um", "a", "the","you know", "i mean"]

    #initalize clean phrase string
    clean_input_phrase= ""

    #copying phrase_list into a new variable for stopword removal.
    Copy_phrase_list = list(phrase_list)

    #Cleanup loop

    for i in range(1,len(phrase_list)):
        has_stop_words = False

        for x in range(len(stop_words_lst)):
            has_stop_words = False

            #if one of the stop words matches the word passed by the first main loop      the  flag is raised.
            if (phrase_list[i-1]+" "+phrase_list[i]) == stop_words_lst[x].strip():
                has_stop_words = True    

            # this if statement adds the word of the phrase only if the flag is not raised thus making sure all the stop words are filtered out         
            if has_stop_words == True:
                Copy_phrase_list.remove(Copy_phrase_list[i-1])
                Copy_phrase_list.remove(Copy_phrase_list[i-1])

    #first for loop takes a individual words of the phrase given and makes a loop until the whole phrase goes through one word at a time
    for i in range(len(Copy_phrase_list)):
        #flag initialized for marking stop words
        has_stop_words = False

        #second loop takes all the stop words and compares them to the first word passed on by the first loop to sheck for a stop word
        for x in range(len(stop_words_lst)):
            #if one of the stop words matches the word passed by the first main loop the  flag is raised.
            if Copy_phrase_list[i] == stop_words_lst[x].strip():
            has_stop_words = True    

        # this if statement adds the word of the phrase only if the flag is not raised thus making sure all the stop words are filtered out        
        if has_stop_words == False:
            clean_input_phrase += str(Copy_phrase_list[i]) +" "


return clean_input_phrase

【问题讨论】：

您的缩进错误。您能否更正它并提供函数输入示例和预期输出。
@Marcin 输入函数可以是任何类型的短语/命令。它只是为了在进一步分析之前从输入中删除这些。但我确实修复了缩进并添加了一些示例短语。
我试过你的code，我没有收到任何错误。它似乎对我有用。
它运行但返回错误的输出。而不是采取“你知道我的意思所以只是打开灯”，删除“你知道”，“我的意思”，“所以”和“the”并返回“打开灯的内容”。它返回“我的意思是打开灯”@Marcin 而且它似乎适用于某些人而不适用于其他人。像 ["you","know","lock","my","computer","yo","man","you","know"] 似乎没有运行。

标签： python python-2.7 nlp stop-words

【解决方案1】：

使用正则表达式替换功能。用空字符串替换每个匹配项。

stop_words_lst = ['yo', 'so', 'well', 'um', 'a', 'the', 'you know', 'i mean']
s = "you know what i mean so just turn the lights on"

import re
for w in stop_words_lst:
    pattern = r'\b'+w+r'\b'
    s = re.sub(pattern, '', s)
    print (s)

【讨论】：

【解决方案2】：

您需要分开您的单词列表。一个应该用于单个单词，另一个应该用于短语。

single_word_list = ["yo", "so", "well", "um", "a", "the"]
phrase_list = ["you know", "i mean"]
for index, word in enumerate(Copy_phrase_list) :
    if word in single_word_lst:
        del Copy_phrase_list[index] 
    if word + " " + Copy_phrase_list[index+1] in phrase_list:
        del Copy_phrase_list[index] 
        del Copy_phrase_list[index+1] 
return " ".join(Copy_phrase_list)

然后您需要将 copy_phrase_list 转换为字符串并返回。

【讨论】：

删除所有 for 循环并添加以下 for 循环。你会完成的。
它似乎并没有删除“你知道”和“我的意思是”@geekpradd 等两个词组
好吧，我没看到……让我破解一些东西。稍等片刻。
如果你能想出一些东西或改进我的，那就太棒了。目的是让我能够在修改这个项目时将更多的多词短语或单个词添加到停用词列表中，并且代码仍然可以从用户给定的输入中删除它们。
让我们continue this discussion in chat。