将 NLTK 停用词与 scikit-learn 的 TfidfVectorizer 一起使用时出现 Unicode 警告答案

【问题标题】：Unicode Warning when using NLTK stopwords with TfidfVectorizer of scikit-learn将 NLTK 停用词与 scikit-learn 的 TfidfVectorizer 一起使用时出现 Unicode 警告
【发布时间】：2014-10-16 02:49:58
【问题描述】：

我正在尝试使用来自 scikit-learn 的 Tf-idf Vectorizer，使用来自 NLTK 的西班牙语停用词：

from nltk.corpus import stopwords

vectorizer = TfidfVectorizer(stop_words=stopwords.words("spanish"))

问题是我收到以下警告：

/home/---/.virtualenvs/thesis/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py:122: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
tokens = [w for w in tokens if w not in stop_words]

有没有简单的方法来解决这个问题？

【问题讨论】：

标签： python python-2.7 unicode scikit-learn nltk

【解决方案1】：

其实这个问题比我想象的要容易解决。这里的问题是 NLTK 不返回 unicode 对象，而是 str 对象。所以我需要在使用它们之前从 utf-8 解码它们：

stopwords = [word.decode('utf-8') for word in stopwords.words('spanish')]

【讨论】：

你无法解码为 utf-8，没有这样的东西。您可以将具有 unicode 表示的字符串编码为 utf-8，或者将 utf-8 编码的字符串解码为 unicode 表示。
@oztalha 我在这里修正了措辞，现在它是正确的。