Scikit Learn Count Vectorizer 找不到所有标记答案

【问题标题】：Scikit Learn Count Vectorizer does not find all tokensScikit Learn Count Vectorizer 找不到所有标记
【发布时间】：2019-01-23 05:24:08
【问题描述】：

我有一个包含 129,013 个文件的数据集，并希望对它们进行逐行编码，即出现的一行是一个标记。我使用了 scikit learn 中的 countVectorizer 并使用了

vec = CountVectorizer(input='filename', token_pattern='.+')
dtm = vec.fit_transform(all_paths) # all paths is a list with all filename paths
print(dtm.shape) # (129013 , 541107)

在研究论文中，数据集指的是，作者谈到了 545,333 个不同的标记，即我的标记器没有捕获所有内容。为了检查我的数据集是否完整，我跑了

for f in *; do cat $f; done | sort | uniq | wc -l
545333

在一个 bash shell 中，表明一切都在那里。我在这里错过了什么？

【问题讨论】：

标签： python scikit-learn countvectorizer

【解决方案1】：

如果其他人遇到类似的问题，CountVectorizer默认情况下是一个如下所al的小写参数。使用

vec = CountVectorizer(input='filename', token_pattern='.+', lowercase=False)

解决了问题。

【讨论】：

>>>>添加token_pattern您正在使用的@ 987654322。因为这也不是 CountVectorizer 的默认值。