如何将 sklearn CountVectorizer 与多个字符串一起使用？答案

【问题标题】：How can I use sklearn CountVectorizer with mutliple strings?如何将 sklearn CountVectorizer 与多个字符串一起使用？
【发布时间】：2017-04-27 18:57:41
【问题描述】：

我有一个字符串列表（10,000 个）。一些字符串构成多个单词。我有另一个列表，其中包含一些句子。我正在尝试计算列表中每个字符串在每个句子中出现的次数。

目前我正在使用 sklearn 的特征提取工具，因为当我们要查找 10,000 条字符串和 10,000 条句子时，它的工作速度非常快。

以下是我的代码的简化版本。

import numpy as np
from sklearn import feature_extraction

sentences = ["hi brown cow", "red ants", "fierce fish"]

listOfStrings = ["brown cow", "ants", "fish"]

cv = feature_extraction.text.CountVectorizer(vocabulary=listOfStrings)
taggedSentences = cv.fit_transform(sentences).toarray()

taggedSentencesCutDown = taggedSentences > 0
# Here we get an array of tuples <sentenceIndex, stringIndexfromStringList>
taggedSentencesCutDown = np.column_stack(np.where(taggedSentencesCutDown))

目前，如果你运行它，输出如下：

In [2]: taggedSentencesCutDown
Out[2]: array([[1, 1], [2, 2]])

我想要的是：

In [2]: taggedSentencesCutDown
Out[2]: array([[0,0], [1, 1], [2, 2]])

我当前对 CountVectorizer 的使用表明它不是在寻找多个单词字符串。有没有其他方法可以做到这一点而无需进入长 for 循环。效率和时间对我的应用非常重要，因为我的列表有 10,000 多个。

谢谢

【问题讨论】：

看一下参数analyzer。我认为用 sklearn 的CountVectorizer 做你想做的事情是不可能的，因为它只支持单词或 n-gram 字符，而不支持多个单词。你可以通过传递你自己的callable 函数来覆盖它，但是对于每个句子，你不仅要返回单词，还要返回它们之间的所有组合。除非您对listOfStrings 中的字数或字数有更具体的限制，否则问题不会很快解决。

标签： python numpy scikit-learn nltk

【解决方案1】：

我通过使用 CountVectorizer 中的 n-grams 参数设法解决了这个问题。

如果我能够在我的单词列表中找到单个字符串中的最大单词数，我可以将其设置为我的 n-gram 的上限。在上面的示例中，它是带有两个的“棕牛”。

cv = feature_extraction.text.CountVectorizer(vocabulary=listOfStrings,
       ngram_range=(1, 2))

【讨论】：