从文档术语矩阵计算前 n 个单词对共现答案

【问题标题】：Computing top n word pair co-occurrences from document term matrix从文档术语矩阵计算前 n 个单词对共现
【发布时间】：2018-12-12 02:36:53
【问题描述】：

我使用 gensim 创建了一个词袋模型。虽然实际上要长得多，但这是使用 Gensim 在标记化文本上创建词袋文档术语矩阵时输出的格式：

id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

[[(0, 2),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 11),
  (385, 1),
  (386, 2),
  (387, 3),
  (388, 1),
  (389, 1),
  (390, 1)],
 [(4, 31),
  (8, 2),
  (13, 2),
  (16, 2),
  (17, 2),
  (26, 1),
  (28, 4),
  (29, 1),
  (30, 1)]]

这是一个稀疏矩阵表示，据我了解，其他库也以类似的方式表示文档术语矩阵。如果文档术语矩阵是非稀疏的（意味着零条目也存在），我知道我只需要 (AT*A)，因为 A 是维度的（文档数乘术语数），因此将两者相乘将给出术语共现。最终，我想获得前 n 个共现（所以要获得在相同文本中一起出现的前 n 个术语对）。我将如何实现这一目标？我不依赖 Gensim 来创建 BOW 模型。如果像 sklearn 这样的其他库可以更轻松地做到这一点，我非常开放。我将不胜感激有关此问题的任何建议/帮助/代码-谢谢！

【问题讨论】：

标签： python matrix scikit-learn gensim text-analysis

【解决方案1】：

编辑：这是实现您询问的矩阵乘法的方法。免责声明：这对于非常大的语料库可能不可行。

Sklearn：

from sklearn.feature_extraction.text import CountVectorizer

Doc1 = 'Wimbledon is one of the four Grand Slam tennis tournaments, the others being the Australian Open, the French Open and the US Open.'
Doc2 = 'Since the Australian Open shifted to hardcourt in 1988, Wimbledon is the only major still played on grass'
docs = [Doc1, Doc2]

# Instantiate CountVectorizer and apply it to docs
cv = CountVectorizer()
doc_cv = cv.fit_transform(docs)

# Display tokens
cv.get_feature_names()

# Display tokens (dict keys) and their numerical encoding (dict values)
cv.vocabulary_

# Matrix multiplication of the term matrix
token_mat = doc_cv.toarray().T @ doc_cv.toarray()

Gensim：

import gensim as gs
import numpy as np

cp = [[(0, 2),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 11),
  (7, 1),
  (11, 2),
  (13, 3),
  (22, 1),
  (26, 1),
  (30, 1)],
 [(4, 31),
  (8, 2),
  (13, 2),
  (16, 2),
  (17, 2),
  (26, 1),
  (28, 4),
  (29, 1),
  (30, 1)]]

# Convert to a dense matrix and perform the matrix multiplication
mat_1 = gs.matutils.sparse2full(cp[0], max(cp[0])[0]+1).reshape(1, -1)
mat_2 = gs.matutils.sparse2full(cp[1], max(cp[0])[0]+1).reshape(1, -1)
mat = np.append(mat_1, mat_2, axis=0)
mat_product = mat.T @ mat

对于连续出现的单词，您可以为一组文档准备一个二元组列表，然后使用 python 的计数器来计算二元组的出现次数。这是一个使用 nltk 的示例。

import nltk
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from collections import Counter

stop_words = set(stopwords.words('english'))

# Get the tokens from the built-in collection of presidential inaugural speeches
tokens = nltk.corpus.inaugural.words()

# Futher text preprocessing
tokens = [t.lower() for t in tokens if t not in stop_words]
word_l = WordNetLemmatizer()
tokens = [word_l.lemmatize(t) for t in tokens if t.isalpha()]

# Create bigram list and count bigrams
bi_grams = list(ngrams(tokens, 2)) 
counter = Counter(bi_grams)

# Show the most common bigrams
counter.most_common(5)
Out[36]: 
[(('united', 'state'), 153),
 (('fellow', 'citizen'), 116),
 (('let', 'u'), 99),
 (('i', 'shall'), 96),
 (('american', 'people'), 40)]

# Query the occurrence of a specific bigram
counter[('great', 'people')]
Out[37]: 7

【讨论】：

感谢分享！这只会查看连续出现的单词，对吗？因此，如果我理解正确，它并不关心在同一个文档中是否经常提到（但分开）单词。
我最初并没有意识到这一点！感谢您的帮助！
我意识到我的原始答案仅适用于您建议的连续出现的单词。请参阅您询问的矩阵乘法的更新答案。如果您的语料库有数百万个单词，我不确定这是否可行。