更快的 sklearn tf-idf 矢量化器答案

【问题标题】：faster sklearn tf-idf vectorizer更快的 sklearn tf-idf 矢量化器
【发布时间】：2023-03-17 11:52:01
【问题描述】：

我在一个项目中尝试使用 sklearn 的 TfidfVectorizer，但是 Tfidf Vectorizer 似乎占用了很多时间......

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

def tokenize_spacy(sentence):
    nlp = spacy.load('ja_core_news_lg')
    doc = nlp(sentence)
    return [w.text for w in doc]

def read_corpus(filename):
    corpus = []
    with open(filename, 'r', encoding='utf-8') as fin:
        for line in fin:
            line = line.rstrip('\n')
            corpus.append(line)
    return corpus

vectorizer = TfidfVectorizer(tokenizer=tokenize_spacy, ngram_range=(1, 4), stop_words=stop_words)
corpus = read_corpus(args.corpus)
matrix = vectorizer.fit_transform(corpus)

模型'ja_core_news_lg'来自here，语料文件大小为2.7GB，stop_words是一个长度小于100的数组。vectorizer已经运行了超过48小时所以我想知道是否有一种方法可以更有效地拟合矢量化器，或者是否有更快的替换。

我有 56 个 CPU，但这个程序似乎只能在其中一个上运行。我已经看到this 的回答，但由于我需要在之后执行vectorizer.get_feature_names()，所以使用 HashingVectorizer 似乎不适合我。

任何帮助将不胜感激，非常感谢！

【问题讨论】：

标签： python machine-learning scikit-learn nlp spacy

【解决方案1】：

矢量化器不是问题，它是让你慢下来的标记器。

对于每个文档，tokenize-function 都会重新加载 spacy 模型，这显然需要很多时间。相反，尝试只加载一次 spacy 模型：

nlp = spacy.load('ja_core_news_lg')

def tokenize_spacy(sentence):
    doc = nlp(sentence)
    return [w.text for w in doc]

def read_corpus(filename):
    corpus = []
    with open(filename, 'r', encoding='utf-8') as fin:
        for line in fin:
            line = line.rstrip('\n')
            corpus.append(line)
    return corpus

vectorizer = TfidfVectorizer(tokenizer=tokenize_spacy, ngram_range=(1, 4), stop_words=stop_words)
corpus = read_corpus(args.corpus)
matrix = vectorizer.fit_transform(corpus)

有影响吗？

【讨论】：

对不起，之前应该提过这个，我试过这种方法，但它不起作用:(它也运行了超过 45 小时。
你确定吗？我用 500 个虚拟句子测试了两种方式：我的用了 2.8 秒，你的用了至少 10 分钟（并且还在运行）
是的 :( 我很确定！考虑到我的数据是 6000 万行，你的方法已经在我的机器上运行了 70 多个小时......