单词和短语的 Python 共现矩阵答案

【问题标题】：Python Co-occurrence matrix of words and phrases单词和短语的 Python 共现矩阵
【发布时间】：2016-06-30 08:06:56
【问题描述】：

我正在处理两个文本文件。一个包含 58 个单词的列表 (L1)，另一个包含 1173 个短语 (L2)。我想检查for i in range(len(L1)) 和for j in range(len(L1)) 在L2 中的共现。

例如：

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day']

for i in range(len(L1)):
    for j in range(len(L1)):
        for s in range(len(L2)):
            if L1[i] in L2[s] and L1[j] in L2[s]:
                output = L1[i], L1[j], L2[s]
                print output

输出（例如 'be your self' 来自 L2）：

('b', 'b', 'be your self')
('b', 'e', 'be your self')
('b', 'y', 'be your self')
('e', 'b', 'be your self')
('e', 'e', 'be your self')
('e', 'y', 'be your self')
('y', 'b', 'be your self')
('y', 'e', 'be your self')
('y', 'y', 'be your self')

输出显示我想要的，但为了可视化数据，我还需要返回时间 L1[j]concurs 和 L1[i]。

例如：

我应该使用pandas 还是numpy 来返回这个结果？

我发现了这个关于共现矩阵的问题，但我没有找到具体的答案。 efficient algorithm for finding co occurrence matrix of phrases

谢谢！

【问题讨论】：

你能用字典代替吗？ {'bb': 1, 'be': 1, ... etc}
我很困惑，您的矩阵输出中的计数对应于什么？ “做你自己”有两个e没关系吗？您希望所有计数都在一个集合中，还是希望每个短语都有一个集合？
这很重要，对不起。我刚刚编辑。我想要每个短语的集合。

标签： python python-2.7 numpy pandas matrix

【解决方案1】：

好吧，你为什么不试试这个？

from collections import defaultdict

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day', 'yes be your self']

d = dict.fromkeys(L2)

for s, phrase in enumerate(L2):
    d[phrase] = defaultdict(int)
    for letter1 in phrase:
        for letter2 in phrase:
            if letter1 in L1 and letter2 in L1:
                output = letter1, letter2, phrase
                print output
                key = (letter1, letter2)
                d[phrase][key] += 1

print d

要捕获重复值，您需要遍历短语，不是列表 L1，然后查看短语中的每个字母是否在 L1 中（换句话说，交换 in 表达式)。

输出：

{
'x men': defaultdict(<type 'int'>, {('e', 'e'): 1, ('e', 'x'): 1, ('x', 'x'): 1, ('x', 'e'): 1}),
'great zoo': defaultdict(<type 'int'>, {('t', 't'): 1, ('t', 'z'): 1, ('e', 'e'): 1, ('e', 'z'): 1, ('t', 'e'): 1, ('z', 'e'): 1, ('z', 't'): 1, ('e', 't'): 1, ('z', 'z'): 1}),
'the onion': defaultdict(<type 'int'>, {('e', 't'): 1, ('t', 'e'): 1, ('e', 'e'): 1, ('t', 't'): 1}),
'be your self': defaultdict(<type 'int'>, {('b', 'y'): 1, ('b', 'b'): 1, ('e', 'e'): 4, ('y', 'e'): 2, ('y', 'b'): 1, ('y', 'y'): 1, ('e', 'b'): 2, ('e', 'y'): 2, ('b', 'e'): 2}),
'corn day': defaultdict(<type 'int'>, {('d', 'd'): 1, ('y', 'd'): 1, ('d', 'y'): 1, ('y', 'y'): 1, ('y', 'c'): 1, ('c', 'c'): 1, ('c', 'y'): 1, ('c', 'd'): 1, ('d', 'c'): 1}),
'yes be your self': defaultdict(<type 'int'>, {('b', 'y'): 2, ('b', 'b'): 1, ('e', 'e'): 9, ('y', 'e'): 6, ('y', 'b'): 2, ('y', 'y'): 4, ('e', 'b'): 3, ('e', 'y'): 6, ('b', 'e'): 3})
}

【讨论】：

优秀。但是，如果phrase包含两次来自L1的项目，是否有可能得到，例如：'b'与'yes be your self'中的'y'一致，所以y的值为2。
好的，我已经编辑了答案。这些是预期值吗？
没错。非常感谢，K. Menyah。这个解决方案对我有帮助！

【解决方案2】：

这是一个使用itertools.product 的解决方案。这应该比公认的解决方案好得多（如果这是一个问题）。

from itertools import product
from operator import mul

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day']

phrase_map = {}

for phrase in L2:
    word_count = {word: phrase.count(word) for word in L1 if word in phrase}

    occurrence_map = {}
    for perm in product(word_count, repeat=2):
        occurrence_map[perm] = reduce(mul, (word_count[key] for key in perm), 1)

    phrase_map[phrase] = occurrence_map

从我的时间来看，这在 Python 3 中要快 2-4 倍（Python 2 中的改进可能较小）。此外，在 Python 3 中，您需要从 functools 导入 reduce。

编辑：请注意，虽然此实现相对简单，但存在明显的低效率。例如，我们知道相应的输出是对称的，这个解决方案没有利用它。使用 combinations_with_replacements 而不是 product 将仅在输出矩阵的上三角部分生成条目。因此，我们可以通过以下方式改进上述解决方案：

from itertools import combinations_with_replacement

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day']

phrase_map = {}

for phrase in L2:
    word_count = {word: phrase.count(word) for word in L1 if word in phrase}

    occurrence_map = {}
    for x, y in combinations_with_replacement(word_count, 2):
        occurrence_map[(x,y)] = occurrence_map[(y,x)] = \
            word_count[x] * word_count[y]

    phrase_map[phrase] = occurrence_map

return phrase_map

不出所料，这个版本的时间是之前版本的一半。请注意，此版本依赖于将自己限制为两个元素对，而之前的版本没有。

请注意，如果该行可以减少大约 15-20% 的运行时间

 occurrence_map[(x,y)] = occurrence_map[(y,x)] = ...

改为

occurrence_map[(x,y)] = ...

但这可能不太理想，具体取决于您将来如何使用此映射。

【讨论】：

非常好的解决方案
谢谢你，杰瑞德。很好的解决方案。它在 python 2.7 中运行良好。
@estebanpdl 如果您想进一步减少时间，我已经改进了解决方案；查看Edit之后的代码。

【解决方案3】：

你可以试试下面的代码。

import collections, numpy
    tokens=['He','is','not','lazy','intelligent','smart']
    j=0
    a=np.zeros((len(tokens),len(tokens)))
    for pos,token in enumerate(tokens):
        j+=pos+1
        for token1 in tokens[pos+1:]:
            count = 0
            for sentence in [['He','is','not','lazy','He','is','intelligent','He','is','smart'] ]:
                    occurrences1 = [i for i,e in enumerate(sentence) if e == token1]
                    #print(token1,occurrences1)
                    occurrences2 = [i for i,e in enumerate(sentence) if e == token]
                    #print(token,occurrences2)
                    new1= np.repeat(occurrences1,len(occurrences2))
                    new2= np.asarray(occurrences2*len(occurrences1))
                    final_new= np.subtract(new1,new2)
                    final_abs_diff = np.absolute(final_new)
                    final_counts = collections.Counter(final_abs_diff)
                    count_1=final_counts[1]
                    count_2=final_counts[2]
                    count_0=final_counts[0]
                    count=count_1+count_2+count_0
            a[pos][j]=count
            #print(token,' ',pos,' ',token1,' ',j,' ',count)
            j+=1
        j=0

    final_mat = a.T+a
    print(final_mat)

输出是：

[[0. 4. 2. 1. 2. 1.]
 [4. 0. 1. 2. 2. 1.]
 [2. 1. 0. 1. 0. 0.]
 [1. 2. 1. 0. 0. 0.]
 [2. 2. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0.]]

【讨论】：