当我传递自定义词汇表时，Python 中的 CountVectorizer() 返回全零答案

【问题标题】：CountVectorizer() in Python is returning all zeros when I pass a custom vocabulary list当我传递自定义词汇表时，Python 中的 CountVectorizer() 返回全零
【发布时间】：2021-12-24 11:42:22
【问题描述】：

from sklearn.feature_extraction.text import CountVectorizer

foo = ["the Cat is :", "is smart now"]
cv = CountVectorizer(vocabulary = foo)
new_list =["the Cat is : the most","is smart now"]
data = cv.fit_transform(new_list).toarray()
print(data)

代码返回以下内容：

[[0 0]
[0 0]]

但我希望它返回：

[[1 0]
[0 1]]

我尝试调整传递给CountVectorizer() 的参数，但似乎没有任何解决方法。有什么建议吗？？

【问题讨论】：

这对您有帮助吗？ stackoverflow.com/questions/55573279/… - 注意他们是如何定义他们的词汇的，它需要是一个字典，按照 docs
根据我对文档的理解：vocab = ["the"] 或 vocab = {'the':0} 都有效，我测试了代码。但是，如果我尝试在字符串中添加空格以具有： vocab = ["the cat"] 或 vocab = {'the cat':0} 我得到全零
你说得对，我读的是可调用不可迭代，我很抱歉。至于其余的，这是因为根据第一个链接进行了标记化。标记字符串给出[['The', 'cat', 'is', 'the', 'most'], ['is', 'smart' 'now']]。您的标记器必须返回所需的标记以匹配词汇。从链接中，您可以看到如何定义自己的标记器来执行此操作。否则，您可能需要重新考虑为什么要尝试使用 CountVectorizer 来完成任务，以及它是否是最佳或最理想的方法。

标签： python scikit-learn countvectorizer

【解决方案1】：

def my_preprocessor(text):
    return text

from sklearn.feature_extraction.text import CountVectorizer
corpus = ["The cat is ! the most","is smart now"]
vocab = ['The cat is !', 'the most']
vectorizer = CountVectorizer(vocabulary=vocab,ngram_range=(1,4), preprocessor=my_preprocessor, token_pattern = '[a-zA-Z0-9$&+,:;=?@#|<>.^*()%!-]+')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

我绕过了preprocessor 并设置了token_pattern 来解决问题。谢谢fam-woodpecker

【讨论】：