类型错误：不可散列类型：使用 Python 字符串集时的列表答案

【问题标题】：Type Error: unhashable type: list when using Python set of strings类型错误：不可散列类型：使用 Python 字符串集时的列表
【发布时间】：2018-06-07 12:06:45
【问题描述】：

我知道关于这个确切的问题，这里有几个非常相似的答案，但没有一个能真正回答我的问题。

我正在尝试从单词列表中删除一系列停用词和标点符号以执行基本的自然语言处理。

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation


    text = "Hello there. I am currently typing Python. "
    custom_stopwords = set(stopwords.words('english')+list(punctuation))

    # tokenizes the text into a sentence
    sentences = sent_tokenize(text)

    # tokenizes each sentence into a list of words
    words = [word_tokenize(sentence) for sentence in sentences]
    filtered_words = [word for word in words if word not in custom_stopwords]
    print(filtered_words)

这会在filtered_words 行上引发TypeError: unhashable type: 'list' 错误。为什么会抛出这个错误？我根本不提供list 集合- 我提供的是set？

注意：我已经阅读了SO on this exact error 上的帖子，但仍然有同样的问题。接受的答案提供了这样的解释：

集合要求它们的项目是可散列的。超出预定义的类型 Python 只有不可变的，例如字符串、数字和元组，是可散列的。 可变类型（例如列表和字典）不可散列因为改变它们的内容会改变哈希并破坏查找代码。

我这里提供了一组字符串，那为什么 Python 还在抱怨呢？

编辑：在阅读了此SO post（建议使用tuples）的更多内容后，我编辑了我的集合对象：

custom_stopwords = tuple(stopwords.words('english'))

我还意识到我必须展平我的列表，因为word_tokenize(sentence) 将创建一个列表列表，并且不会正确过滤掉标点符号（因为列表对象不会在custom_stopwords 中，这是一个字符串列表.

然而，这仍然引出了一个问题——为什么元组被 Python 认为是可散列的，而字符串集却不是？为什么TypeError 说list？

【问题讨论】：

试试this 发帖

标签： python list nltk

【解决方案1】：

words 是一个列表列表，因为word_tokenize() 返回一个单词列表。

当您执行[word for word in words if word not in custom_stopwords] 时，每个word 实际上都是list 类型。当需要检查 word not in custom_stopwords "is in set" 条件时，需要对 word 进行散列处理，但由于列表是可变容器并且在 Python 中不可散列处理，因此会失败。

这些帖子可能有助于理解什么是“可散列”以及为什么可变容器不是：

【讨论】：

知道了。事实证明，需要扁平化列表：words = [word_tokenize(sentence) for sentence in sentences] flattened_words = [item for sublist in words for item in sublist] filtered_words = [word for word in flattened_words if word not in custom_stopwords]，因为这会将words 转换为字符串列表，而不是列表列表。一旦我将列表展平为字符串列表，我就可以使用我想要进行过滤的任何集合（集合、元组等）