【问题标题】:Co-occurrence matrix from nested list of words来自嵌套单词列表的共现矩阵
【发布时间】:2017-08-06 11:15:13
【问题描述】:

我有一个名字列表,例如:

names = ['A', 'B', 'C', 'D']

还有一份文件列表,在每个文件中都提到了其中的一些名称。

document =[['A', 'B'], ['C', 'B', 'K'],['A', 'B', 'C', 'D', 'Z']]

我想得到一个作为共现矩阵的输出,例如:

  A  B  C  D
A 0  2  1  1
B 2  0  2  1
C 1  2  0  1
D 1  1  1  0

在 R 中有这个问题的解决方案 (Creating co-occurrence matrix),但我无法在 Python 中解决。我正在考虑用 Pandas 做,但没有进展!

【问题讨论】:

标签: python pandas list matrix networkx


【解决方案1】:

我们可以使用NetworkX 大大简化这个过程。这里names 是我们要考虑的节点,document 中的列表包含要连接的节点。

我们可以连接每个长度为 2 的子列表中的节点combinations,并创建一个MultiGraph 来解决共现问题:

import networkx as nx
from itertools import combinations

G = nx.from_edgelist((c for n_nodes in document for c in combinations(n_nodes, r=2)),
                     create_using=nx.MultiGraph)
nx.to_pandas_adjacency(G, nodelist=names, dtype='int')

   A  B  C  D
A  0  2  1  1
B  2  0  2  1
C  1  2  0  1
D  1  1  1  0

【讨论】:

    【解决方案2】:

    '''对于2的窗口,data_corpus是由文本数据组成的序列,words是由构建共现矩阵的单词组成的列表'''

    "co_oc 是共现矩阵"

    co_oc=pd.DataFrame(index=words,columns=words)
    
    for j in tqdm(data_corpus):
    
        k=j.split()
    
        for l in range(len(k)):
    
            if l>=5 and l<(len(k)-6):
                if k[l] in words:
                    for m in range(l-5,l+6):
                        if m==l:
                            continue
                        elif k[m] in words:
                            co_oc[k[l]][k[m]]+=1
    
            elif l>=(len(k)-6):
                if k[l] in words:
                    for m in range(l-5,len(k)):
                        if m==l:
                            continue
                        elif k[m] in words:
                            co_oc[k[l]][k[m]]+=1
    
            else:
                if k[l] in words:
                    for m in range(0,l+5):
                        if m==l:
                            continue
                        elif k[m] in words:
                            co_oc[k[l]][k[m]]+=1
    print(co_oc.head())
    

    【讨论】:

      【解决方案3】:

      我遇到了同样的问题...所以我使用了这段代码。此代码考虑上下文窗口,然后确定共现矩阵。

      希望对你有帮助...

      def countOccurences(word,context_window): 
      
          """
          This function returns the count of context word.
          """ 
          return context_window.count(word)
      
      def co_occurance(feature_dict,corpus,window = 5):
          """
          This function returns co_occurance matrix for the given window size. Default is 5.
      
          """
          length = len(feature_dict)
          co_matrix = np.zeros([length,length]) # n is the count of all words
      
          corpus_len = len(corpus)
          for focus_word in top_features:
      
              for context_word in top_features[top_features.index(focus_word):]:
                  # print(feature_dict[context_word])
                  if focus_word == context_word:
                      co_matrix[feature_dict[focus_word],feature_dict[context_word]] = 0
                  else:
                      start_index = 0
                      count = 0
                      while(focus_word in corpus[start_index:]):
      
                          # get the index of focus word
                          start_index = corpus.index(focus_word,start_index)
                          fi,li = max(0,start_index - window) , min(corpus_len-1,start_index + window)
      
                          count += countOccurences(context_word,corpus[fi:li+1])
                          # updating start index
                          start_index += 1
      
                      # update [Aij]
                      co_matrix[feature_dict[focus_word],feature_dict[context_word]] = count
                      # update [Aji]
                      co_matrix[feature_dict[context_word],feature_dict[focus_word]] = count
          return co_matrix
      

      【讨论】:

        【解决方案4】:

        另一种选择是使用构造函数 csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)]) 来自 scipy.sparse.csr_matrix 其中datarow_indcol_ind 满足 关系a[row_ind[k], col_ind[k]] = data[k]

        诀窍是通过迭代文档并创建元组列表(doc_id、word_id)来生成row_indcol_inddata 只是长度相同的向量。

        将 docs-words 矩阵乘以其转置将得到共现矩阵。

        此外,这在运行时间和内存使用方面都很有效,因此它还应该处理大型语料库。

        import numpy as np
        import itertools
        from scipy.sparse import csr_matrix
        
        
        def create_co_occurences_matrix(allowed_words, documents):
            print(f"allowed_words:\n{allowed_words}")
            print(f"documents:\n{documents}")
            word_to_id = dict(zip(allowed_words, range(len(allowed_words))))
            documents_as_ids = [np.sort([word_to_id[w] for w in doc if w in word_to_id]).astype('uint32') for doc in documents]
            row_ind, col_ind = zip(*itertools.chain(*[[(i, w) for w in doc] for i, doc in enumerate(documents_as_ids)]))
            data = np.ones(len(row_ind), dtype='uint32')  # use unsigned int for better memory utilization
            max_word_id = max(itertools.chain(*documents_as_ids)) + 1
            docs_words_matrix = csr_matrix((data, (row_ind, col_ind)), shape=(len(documents_as_ids), max_word_id))  # efficient arithmetic operations with CSR * CSR
            words_cooc_matrix = docs_words_matrix.T * docs_words_matrix  # multiplying docs_words_matrix with its transpose matrix would generate the co-occurences matrix
            words_cooc_matrix.setdiag(0)
            print(f"words_cooc_matrix:\n{words_cooc_matrix.todense()}")
            return words_cooc_matrix, word_to_id 
        

        运行示例:

        allowed_words = ['A', 'B', 'C', 'D']
        documents = [['A', 'B'], ['C', 'B', 'K'],['A', 'B', 'C', 'D', 'Z']]
        words_cooc_matrix, word_to_id = create_co_occurences_matrix(allowed_words, documents)
        

        输出:

        allowed_words:
        ['A', 'B', 'C', 'D']
        
        documents:
        [['A', 'B'], ['C', 'B', 'K'], ['A', 'B', 'C', 'D', 'Z']]
        
        words_cooc_matrix:
        [[0 2 1 1]
         [2 0 2 1]
         [1 2 0 1]
         [1 1 1 0]]
        

        【讨论】:

          【解决方案5】:

          您也可以使用矩阵技巧来找到共现矩阵。希望当您拥有更大的词汇量时这会很好。

          import scipy.sparse as sp
          voc2id = dict(zip(names, range(len(names))))
          rows, cols, vals = [], [], []
          for r, d in enumerate(document):
              for e in d:
                  if voc2id.get(e) is not None:
                      rows.append(r)
                      cols.append(voc2id[e])
                      vals.append(1)
          X = sp.csr_matrix((vals, (rows, cols)))
          

          现在,您可以通过简单地将X.TX 相乘来找到共现矩阵

          Xc = (X.T * X) # coocurrence matrix
          Xc.setdiag(0)
          print(Xc.toarray())
          

          【讨论】:

          • 我尝试了您提到的解决方案,但它向最终矩阵添加了新字符串,不过,我只对名称列表中的字符串感兴趣,而不是文档中的所有其他字符串。
          • 最佳解决方案!!
          【解决方案6】:

          这是另一个使用itertoolscollections 模块中的Counter 类的解决方案。

          import numpy
          import itertools
          from collections import Counter
          
          document =[['A', 'B'], ['C', 'B'],['A', 'B', 'C', 'D']]
          
          # Get all of the unique entries you have
          varnames = tuple(sorted(set(itertools.chain(*document))))
          
          # Get a list of all of the combinations you have
          expanded = [tuple(itertools.combinations(d, 2)) for d in document]
          expanded = itertools.chain(*expanded)
          
          # Sort the combinations so that A,B and B,A are treated the same
          expanded = [tuple(sorted(d)) for d in expanded]
          
          # count the combinations
          c = Counter(expanded)
          
          
          # Create the table
          table = numpy.zeros((len(varnames),len(varnames)), dtype=int)
          
          for i, v1 in enumerate(varnames):
              for j, v2 in enumerate(varnames[i:]):        
                  j = j + i 
                  table[i, j] = c[v1, v2]
                  table[j, i] = c[v1, v2]
          
          # Display the output
          for row in table:
              print(row)
          

          输出(可以很容易地变成一个DataFrame)是:

          [0 2 1 1]
          [2 0 2 1]
          [1 2 0 1]
          [1 1 1 0]
          

          【讨论】:

            【解决方案7】:
            from collections import OrderedDict
            
            document = [['A', 'B'], ['C', 'B'], ['A', 'B', 'C', 'D']]
            names = ['A', 'B', 'C', 'D']
            
            occurrences = OrderedDict((name, OrderedDict((name, 0) for name in names)) for name in names)
            
            # Find the co-occurrences:
            for l in document:
                for i in range(len(l)):
                    for item in l[:i] + l[i + 1:]:
                        occurrences[l[i]][item] += 1
            
            # Print the matrix:
            print(' ', ' '.join(occurrences.keys()))
            for name, values in occurrences.items():
                print(name, ' '.join(str(i) for i in values.values()))
            

            输出;

              A B C D
            A 0 2 1 1 
            B 2 0 2 1 
            C 1 2 0 1 
            D 1 1 1 0 
            

            【讨论】:

              【解决方案8】:

              显然,这可以根据您的目的进行扩展,但它执行的是一般操作:

              import math
              
              for a in 'ABCD':
                  for b in 'ABCD':
                      count = 0
              
                      for x in document:
                          if a != b:
                              if a in x and b in x:
                                  count += 1
              
                          else:
                              n = x.count(a)
                              if n >= 2:
                                  count += math.factorial(n)/math.factorial(n - 2)/2
              
                      print '{} x {} = {}'.format(a, b, count)
              

              【讨论】:

                猜你喜欢
                • 2018-09-01
                • 2016-06-30
                • 1970-01-01
                • 2022-01-10
                • 2020-12-15
                • 2020-05-28
                • 2017-02-01
                • 2016-06-04
                相关资源
                最近更新 更多