使用 CountVectorizer 为 LDA 主题模型准备数据集

【问题标题】：Prepare dataset for the LDA topic models using CountVectorizer使用 CountVectorizer 为 LDA 主题模型准备数据集
【发布时间】：2018-04-08 16:05:44
【问题描述】：

我想使用CountVectorizerfrom Scikit来创建一个供LDA模型使用的矩阵。但我的数据集是一系列编码术语，例如以下形式：

(1-2252, 5-5588, 10-5478, 2-9632 ....)

如何告诉CountVectorizer 将每对数据（即1-2252）视为一个词

【问题讨论】：

标签： python scikit-learn lda topic-modeling countvectorizer

【解决方案1】：

幸运的是，我找到了一个helpful 博客给了我答案。

因为我使用以下方法来标记文本：

import re
REGEX = re.compile(r",\s*")
def tokenize(text):
    return [tok.strip().lower() for tok in REGEX.split(text)]

并将分词器传递给CountVectorizer，如下所示：

tf = CountVectorizer(tokenizer=tokenize)

【讨论】：