【发布时间】:2021-05-29 11:13:14
【问题描述】:
我在这里关注文档:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
假设我已经有一个类似X.toarray() 中给出的词频矩阵,但我没有使用 CountVectorizer 来获取它。
我想对这个矩阵应用一个 TfIDF。有没有办法让我获取一个计数数组 + 一个字典并应用这个函数的一些逆作为构造函数来获得一个 fit_transformed X?
我正在寻找...
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
>>> V = CountVectorizerConstructorPrime(array=(X.toarray()),
vocabulary=['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'])
这样:
>>> V == X
True
【问题讨论】:
标签: scikit-learn