没有使用 python 删除停用词答案

【问题标题】：Stop words are not being removed using python没有使用 python 删除停用词
【发布时间】：2020-05-25 03:18:54
【问题描述】：

我正在尝试从我拥有的标记列表中删除停用词。但是，似乎这些词没有被删除。会有什么问题？谢谢。

试过了：

Trans = []
    with open('data.txt', 'r') as myfile:
        file = myfile.read()
            #start readin from the start of the charecter
        myfile.seek(0)
        for row in myfile:
            split = row.split()
            Trans.append(split)
        myfile.close()


    stop_words = list(get_stop_words('en'))         
    nltk_words = list(stopwords.words('english')) 
    stop_words.extend(nltk_words)

    output = [w for w in Trans if not w in stop_words]


    Input: 

    [['Apparent',
      'magnitude',
      'is',
      'a',
      'measure',
      'of',
      'the',
      'brightness',
      'of',
      'a',
      'star',
      'or',
      'other']]

    output:

    It returns the same words as input.

【问题讨论】：

这可能与您输入的双括号有关。 Trans 的第一个也是唯一的元素是一个单词列表，因此列表理解中的条件通过了。

标签： python nlp stop-words

【解决方案1】：

我认为 Trans.append(split) 应该是 Trans.extend(split) 因为 split 返回一个列表。

【讨论】：

【解决方案2】：

为了提高可读性，创建一个函数。例如：

def drop_stopwords(row):
    stop_words = set(stopwords.words('en'))
    return [word for word in row if word not in stop_words and word not in list(string.punctuation)]

而with open() 不需要close() 并创建一个字符串（句子）列表并应用该函数。例如：

Trans = Trans.map(str).apply(drop_stopwords)

这将应用于每个句子... 可以为lemmitize等添加其他函数，这里有一个非常清晰的例子（代码）： https://github.com/SamLevinSE/job_recommender_with_NLP/blob/master/job_recommender_data_mining_JOBS.ipynb

【讨论】：

【解决方案3】：

由于输入包含列表列表，您需要遍历一次外部列表和内部列表元素，之后您可以获得正确的输出

output = [j for w in Trans for j in w if j not in stop_words]

【讨论】：