匹配对象的算法答案

【问题标题】：Algorithm for matching objects匹配对象的算法
【发布时间】：2014-12-05 15:44:56
【问题描述】：

我有 1000 个对象，每个对象有 4 个属性列表：单词列表、图像列表、音频文件列表和视频文件列表。

我想比较每个对象：

1,000 个中的单个对象，Ox。
所有其他对象。

比较将类似于： sum(常用词+常用图片+...).

我想要一个算法来帮助我找到最接近 Ox 的 5 个对象，例如，和（不同的？）算法来找到最接近的 5 对对象

我研究了聚类分析和最大匹配，但它们似乎并不完全适合这种情况。如果存在更合适的方法，我不想使用这些方法，所以这对任何人来说都是一种特定类型的算法，或者任何人都可以为我指出正确的方向来应用我提到的算法吗？

【问题讨论】：

两张图片什么时候有共同点？
当它们使用汉明距离具有相似的稳健哈希时。

标签： python algorithm pattern-matching cluster-analysis data-mining

【解决方案1】：

我为如何解决您的第一个问题制作了一个示例程序。但是你必须实现你想要比较图像、音频和视频。我假设每个对象对于所有列表都具有相同的长度。要回答你的第二个问题，它会是类似的，但有一个双循环。

import numpy as np
from random import randint

class Thing:

    def __init__(self, words, images, audios, videos):
        self.words  = words
        self.images = images
        self.audios = audios
        self.videos = videos

    def compare(self, other):
        score = 0
        # Assuming the attribute lists have the same length for both objects
        # and that they are sorted in the same manner:
        for i in range(len(self.words)):
            if self.words[i] == other.words[i]:
                score += 1
        for i in range(len(self.images)):
            if self.images[i] == other.images[i]:
                score += 1
        # And so one for audio and video. You have to make sure you know
        # what method to use for determining when an image/audio/video are
        # equal.
        return score


N = 1000
things = []
words  = np.random.randint(5, size=(N,5))
images = np.random.randint(5, size=(N,5))
audios = np.random.randint(5, size=(N,5))
videos = np.random.randint(5, size=(N,5))
# For testing purposes I assign each attribute to a list (array) containing
# five random integers. I don't know how you actually intend to do it.
for i in xrange(N):
    things.append(Thing(words[i], images[i], audios[i], videos[i]))

# I will assume that object number 999 (i=999) is the Ox:
ox = 999
scores = np.zeros(N - 1)
for i in xrange(N - 1):
    scores[i] = (things[ox].compare(things[i]))

best = np.argmax(scores)
print "The most similar thing is thing number %d." % best
print
print "Ox attributes:"
print things[ox].words
print things[ox].images
print things[ox].audios
print things[ox].videos
print
print "Best match attributes:"
print things[ox].words
print things[ox].images
print things[ox].audios
print things[ox].videos

编辑：

现在这里是相同的程序，经过轻微修改以回答您的第二个问题。结果很简单。我基本上只需要添加 4 行：

将 scores 更改为 (N,N) 数组，而不仅仅是 (N)。
添加for j in xrange(N):，从而创建一个双循环。
if i == j:
break

其中 3. 和 4. 只是为了确保我只比较每对事物一次而不是两次，并且不要将任何事物与它们本身进行比较。

然后还需要几行代码来提取scores 中5 个最大值的索引。我还重新设计了印刷，因此很容易通过肉眼确认印刷的对实际上非常相似。

新代码来了：

import numpy as np

class Thing:

    def __init__(self, words, images, audios, videos):
        self.words  = words
        self.images = images
        self.audios = audios
        self.videos = videos

    def compare(self, other):
        score = 0
        # Assuming the attribute lists have the same length for both objects
        # and that they are sorted in the same manner:
        for i in range(len(self.words)):
            if self.words[i] == other.words[i]:
                score += 1
        for i in range(len(self.images)):
            if self.images[i] == other.images[i]:
                score += 1
        for i in range(len(self.audios)):
            if self.audios[i] == other.audios[i]:
                score += 1
        for i in range(len(self.videos)):
            if self.videos[i] == other.videos[i]:
                score += 1
        # You have to make sure you know what method to use for determining
        # when an image/audio/video are equal.
        return score


N = 1000
things = []
words  = np.random.randint(5, size=(N,5))
images = np.random.randint(5, size=(N,5))
audios = np.random.randint(5, size=(N,5))
videos = np.random.randint(5, size=(N,5))
# For testing purposes I assign each attribute to a list (array) containing
# five random integers. I don't know how you actually intend to do it.
for i in xrange(N):
    things.append(Thing(words[i], images[i], audios[i], videos[i]))


################################################################################
############################# This is the new part: ############################
################################################################################
scores = np.zeros((N, N))
# Scores will become a triangular matrix where scores[i, j]=value means that
# value is the number of attrributes thing[i] and thing[j] have in common.
for i in xrange(N):
    for j in xrange(N):
        if i == j:
            break
            # Break the loop here because:
            # * When i==j we would compare thing[i] with itself, and we don't
            #   want that.
            # * For every combination where j>i we would repeat all the
            #   comparisons for j<i and create duplicates. We don't want that.
        scores[i, j] = (things[i].compare(things[j]))

# I want the 5 most similar pairs:
n = 5
# This list will contain a tuple for each of the n most similar pairs:
best_list = []
for k in xrange(n):
    ij = np.argmax(scores) # Returns a single integer: ij = i*n + j
    i = ij / N
    j = ij % N
    best_list.append((i, j))
    # Erease this score so that on next iteration the second largest score
    # is found:
    scores[i, j] = 0

for k, (i, j) in enumerate(best_list):
    # The number 1 most similar pair is the BEST match of all.
    # The number N most similar pair is the WORST match of all.
    print "The number %d most similar pair is thing number %d and %d." \
          % (k+1, i, j)
    print "Thing%4d:" % i, \
          things[i].words, things[i].images, things[i].audios, things[i].videos
    print "Thing%4d:" % j, \
          things[j].words, things[j].images, things[j].audios, things[j].videos
    print

【讨论】：

如果这个答案是你的想法，我可以修改它以找到最接近的 5 对对象。
@schoon 没问题。这对你来说足够了吗，还是我应该扩展它以完全回答第二个问题？
@schoon 我现在已经编辑了答案并添加了第二部分。
做了一个月后，我发现我在自己的工作中需要这个算法。双倍值得！

【解决方案2】：

如果您的比较适用于“创建所有特征的总和并找到最接近总和的那些”，则有一个简单的技巧来获得接近的对象：

将所有对象放入一个数组中
计算所有总和
按总和对数组进行排序。

如果您采用任何索引，则靠近它的对象现在也将具有关闭索引。所以要找到最近的 5 个对象，你只需要在排序后的数组中查看index+5 到 index-5。

【讨论】：