比较嵌套列表的相似度答案

【问题标题】：Comparing the similarity of nested lists比较嵌套列表的相似度
【发布时间】：2018-11-09 15:25:38
【问题描述】：

我有一个包含 3 个列表的列表，每个列表中都有 1 个列表。

data_set = [
    ['AB12345',['T','T','C','C','A','C','A','G','C','T','T','T','T','C']],
    ['AB12346',['T','T','C','C','A','C','C','G','C','T','C','T','T','C']],
    ['AB12347',['T','G','C','C','A','C','G','G','C','T','T','C','T','C']]
]

我有一个比较方法，它会给我包含字符的列表的相似性，而不是 id。

def compare(_from, _to):
    similarity = 0
    length = len(_from)
    if len(_from) != len(_to):
        raise Exception("Cannot be compared due to different length.")
    for i in range(length):
        if _from[i] == _to[i]:
            similarity += 1
    return similarity / length * 100

compare(data_set[0][1], data_set[1][1])

通过使用比较方法，我使用 for 循环将“a”列表与其他列表进行比较，如“a”与“a”比较，“a”与“b”比较，“a”与“b”比较“C”。

for i in range(len(data_set)):
    data_set[i].append(compare(data_set[0][1], data_set[i][1]))
    print(round(data_set[i][2], 2), end=", ")

但是在完成第一个列表与其他列表及其自身的比较后，我如何循环到第二个列表和第三个列表并继续再次与其他列表进行比较以获得它们的相似性？比如，（“b”与“a”比较，“b”与“b”比较，“b”与“c”比较）和（“c”与“a”比较，“c”与“b”比较， "c" 与 "c" 相比）。

【问题讨论】：

标签： python python-3.x list comparison nested-lists

【解决方案1】：

为了将来参考，最好将您的输入列表 (a,b,c) 包含在您的代码中，而不是使用屏幕截图来节省人们必须输入整个列表的时间。我使用了一些较短的版本进行测试。

您可以执行以下操作来遍历两个列表并比较结果。这比使用for i in range(len(data_set)): 更简洁

# Make some test data
a= ["ID_A", ['T', 'G', 'A']]
b= ["ID_B", ['T', 'C', 'A']]
c= ["ID_C", ['C', 'A', 'A']]

data = [a,b,c]

# entry1 takes each of the values a,b,c in order, and entry2 will do the same,
# so you'll have all possible combinations.
for entry1 in data:
    for entry2 in data:
        score = compare(entry1[1], entry2[1])
        print("Compare ", entry1[0], " to ", entry2[0], "Score :", round(score))

输出：

Compare  ID_A  to  ID_A  Score : 100
Compare  ID_A  to  ID_B  Score : 67
Compare  ID_A  to  ID_C  Score : 33
Compare  ID_B  to  ID_A  Score : 67
Compare  ID_B  to  ID_B  Score : 100
Compare  ID_B  to  ID_C  Score : 33
Compare  ID_C  to  ID_A  Score : 33
Compare  ID_C  to  ID_B  Score : 33
Compare  ID_C  to  ID_C  Score : 100

您最好将分数存储在不同的数组中，而不是保存列表。

【讨论】：

感谢您的帮助！！结果看起来更加整洁，这有助于我指出所有比较的相似性。

【解决方案2】：

只需使用第二个这样的嵌套循环

for i in range(len(data_set)):
    for j in range(len(data_set)):
        data_set[i].append(compare(data_set[j][1], data_set[i][1]))
        print(round(data_set[i][2], 2), end=", ")

【讨论】：

嵌套循环使第一个列表与包括自身在内的其他 3 个列表进行比较并重复 3 次。所以它会给出相同的 3 个重复结果。

【解决方案3】：

您也可以使用itertools.combinations 来比较您的所有子列表。此外，在您的 compare() 函数中，您可能需要考虑返回一个指示子列表不可比较的值，而不是引发异常，以便在比较较大的子列表集时不会过早地短路循环。

以下是一个示例（还包括一个稍微简单的compare() 函数版本，当列表由于长度而无法比较时，它会返回-1，但不会执行列表与自身的比较，因为返回值将在那种情况下总是 100，这似乎是一种性能浪费）。

import itertools

data_set = [
    ['AB12345',['T','T','C','C','A','C','A','G','C','T','T','T','T','C']],
    ['AB12346',['T','T','C','C','A','C','C','G','C','T','C','T','T','C']],
    ['AB12347',['T','G','C','C','A','C','G','G','C','T','T','C','T','C']]
    ]

def compare(a, b):
    length = len(a) if len(a) == len(b) else 0
    similarity = sum(1 for i in range(length) if a[i] == b[i])
    return similarity / length * 100 if length else -1

for a, b in itertools.combinations(data_set, 2):
    compared = a[0] + ' and ' + b[0]
    result = compare(a[1], b[1])
    print(f'{compared}: {result}')

# OUTPUT
# AB12345 and AB12346: 85.71428571428571
# AB12345 and AB12347: 78.57142857142857
# AB12346 and AB12347: 71.42857142857143

【讨论】：

感谢您的帮助！！它让我更容易注意到哪些子列表不可比较。在比较所有子列表时，intertools 的性能是否更好？
@danny - 来自文档，itertools 模块标准化了一组核心快速、高效的内存工具。换句话说，它的核心目的之一是高性能。也就是说，迭代器的性能高度依赖于数据集的性质，因此构造良好的 for 循环肯定可以胜过 itertools 或其他类似模块，具体取决于具体情况。如果您想将不同方法的速度与您的数据集进行比较，您应该查看timeit。