【问题标题】:Computing top n word pair co-occurrences from document term matrix从文档术语矩阵计算前 n 个单词对共现
【发布时间】:2018-12-12 02:36:53
【问题描述】:

我使用 gensim 创建了一个词袋模型。虽然实际上要长得多,但这是使用 Gensim 在标记化文本上创建词袋文档术语矩阵时输出的格式:

id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

[[(0, 2),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 11),
  (385, 1),
  (386, 2),
  (387, 3),
  (388, 1),
  (389, 1),
  (390, 1)],
 [(4, 31),
  (8, 2),
  (13, 2),
  (16, 2),
  (17, 2),
  (26, 1),
  (28, 4),
  (29, 1),
  (30, 1)]]

这是一个稀疏矩阵表示,据我了解,其他库也以类似的方式表示文档术语矩阵。如果文档术语矩阵是非稀疏的(意味着零条目也存在),我知道我只需要 (AT*A),因为 A 是维度的(文档数乘术语数) ,因此将两者相乘将给出术语共现。最终,我想获得前 n 个共现(所以要获得在相同文本中一起出现的前 n 个术语对)。我将如何实现这一目标?我不依赖 Gensim 来创建 BOW 模型。如果像 sklearn 这样的其他库可以更轻松地做到这一点,我非常开放。我将不胜感激有关此问题的任何建议/帮助/代码-谢谢!

【问题讨论】:

    标签: python matrix scikit-learn gensim text-analysis


    【解决方案1】:

    编辑:这是实现您询问的矩阵乘法的方法。免责声明:这对于非常大的语料库可能不可行。

    Sklearn:

    from sklearn.feature_extraction.text import CountVectorizer
    
    Doc1 = 'Wimbledon is one of the four Grand Slam tennis tournaments, the others being the Australian Open, the French Open and the US Open.'
    Doc2 = 'Since the Australian Open shifted to hardcourt in 1988, Wimbledon is the only major still played on grass'
    docs = [Doc1, Doc2]
    
    # Instantiate CountVectorizer and apply it to docs
    cv = CountVectorizer()
    doc_cv = cv.fit_transform(docs)
    
    # Display tokens
    cv.get_feature_names()
    
    # Display tokens (dict keys) and their numerical encoding (dict values)
    cv.vocabulary_
    
    # Matrix multiplication of the term matrix
    token_mat = doc_cv.toarray().T @ doc_cv.toarray()
    

    Gensim:

    import gensim as gs
    import numpy as np
    
    cp = [[(0, 2),
      (1, 1),
      (2, 1),
      (3, 1),
      (4, 11),
      (7, 1),
      (11, 2),
      (13, 3),
      (22, 1),
      (26, 1),
      (30, 1)],
     [(4, 31),
      (8, 2),
      (13, 2),
      (16, 2),
      (17, 2),
      (26, 1),
      (28, 4),
      (29, 1),
      (30, 1)]]
    
    # Convert to a dense matrix and perform the matrix multiplication
    mat_1 = gs.matutils.sparse2full(cp[0], max(cp[0])[0]+1).reshape(1, -1)
    mat_2 = gs.matutils.sparse2full(cp[1], max(cp[0])[0]+1).reshape(1, -1)
    mat = np.append(mat_1, mat_2, axis=0)
    mat_product = mat.T @ mat
    

    对于连续出现的单词,您可以为一组文档准备一个二元组列表,然后使用 python 的计数器来计算二元组的出现次数。这是一个使用 nltk 的示例。

    import nltk
    from nltk.util import ngrams
    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords
    from collections import Counter
    
    stop_words = set(stopwords.words('english'))
    
    # Get the tokens from the built-in collection of presidential inaugural speeches
    tokens = nltk.corpus.inaugural.words()
    
    # Futher text preprocessing
    tokens = [t.lower() for t in tokens if t not in stop_words]
    word_l = WordNetLemmatizer()
    tokens = [word_l.lemmatize(t) for t in tokens if t.isalpha()]
    
    # Create bigram list and count bigrams
    bi_grams = list(ngrams(tokens, 2)) 
    counter = Counter(bi_grams)
    
    # Show the most common bigrams
    counter.most_common(5)
    Out[36]: 
    [(('united', 'state'), 153),
     (('fellow', 'citizen'), 116),
     (('let', 'u'), 99),
     (('i', 'shall'), 96),
     (('american', 'people'), 40)]
    
    # Query the occurrence of a specific bigram
    counter[('great', 'people')]
    Out[37]: 7
    

    【讨论】:

    • 感谢分享!这只会查看连续出现的单词,对吗?因此,如果我理解正确,它并不关心在同一个文档中是否经常提到(但分开)单词。
    • 我最初并没有意识到这一点!感谢您的帮助!
    • 我意识到我的原始答案仅适用于您建议的连续出现的单词。请参阅您询问的矩阵乘法的更新答案。如果您的语料库有数百万个单词,我不确定这是否可行。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2016-06-04
    • 1970-01-01
    • 2018-09-07
    • 2018-08-15
    • 2015-06-17
    • 1970-01-01
    相关资源
    最近更新 更多