比较图像相似度的最佳技术是什么？答案

【问题标题】：What's the best technique for comparing images' similarity?比较图像相似度的最佳技术是什么？
【发布时间】：2016-04-14 17:05:24
【问题描述】：

我有一张图片 master.png 和超过 10.000 张其他图片（slave_1.png、slave_2.png、...）。他们都有：

相同的尺寸（例如 100x50 像素）
格式相同（png）
图片背景相同

98% 的 slave 与 master 相同，但 2% 的 slave 内容略有不同：

新颜色出现
新的小形状出现在图像中间

我需要找出那些不同的奴隶。我正在使用 Ruby，但使用其他技术没有问题。

我尝试File.binread 两个图像，然后使用== 进行比较。它适用于 80% 的奴隶。在其他奴隶中，它发现了变化，但图像在视觉上是相同的。所以它不起作用。

替代方案是：

计算每个从属设备中存在的颜色数量并与主设备进行比较。它将在 100% 的时间内工作。但我不知道如何在 Ruby 中以“轻量级”的方式进行操作。
使用一些图像处理器通过直方图进行比较，如RMagick 或ruby-vips8。这种方式也应该可行，但我需要消耗尽可能少的 CPU/内存。
编写一个 C++/Go/Crystal 程序以逐像素读取并返回多种颜色。我认为通过这种方式我们可以从 if 中获得性能。但肯定是艰难的道路。

有什么启示吗？有什么建议吗？

【问题讨论】：

查看this question。那里已经讨论了许多选项。
另一个关于与File.binread 比较的说明。由于您只是在比较文件内容和资源以及重要性的性能，所以最好简单地使用 bash 来做到这一点。查看：diff、cmp 或 md5。
如果您需要分类器，可以为Tensor Flow 工作。
当你说你想以轻量级的方式做的时候，你真的是说你不想使用太多的CPU吗？或者你的意思是你想要快速得到答案——这可能意味着使用所有的 CPU 一段时间？
@MarkSetchell “轻量级”是指使用尽可能少的 CPU/RAM。

标签： ruby performance image-processing

【解决方案1】：

在ruby-vips，你可以这样做：

require 'vips'

# find normalised histogram of reference image
ref = VIPS::Image.new ARGV[0], :sequential => true
ref_hist = ref.hist.histnorm

# trigger a GC every few loops to keep memuse down
loop = 0

ARGV[1..-1].each do |filename|
    # find sample hist
    sample = VIPS::Image.new filename, :sequential => true
    sample_hist = sample.hist.histnorm

    # calculate sum of squares of differences, if it's over a threshold, print
    # the filename
    diff_hist = ref_hist.subtract(sample_hist).pow(2)
    diff = diff_hist.avg * diff_hist.x_size * diff_hist.y_size

    if diff > 100
        puts "#{filename}, #{diff}"
    end

    loop += 1
    if loop % 100 == 0
        GC.start
    end
end

偶尔的GC.start 是让Ruby 释放东西和防止内存填充所必需的。尽管每 100 张图像只有一次，但遗憾的是，它仍然花费大量时间进行垃圾收集。

$ vips crop ~/pics/k2.jpg ref.png 0 0 100 50
$ for i in {1..10000}; do cp ref.png $i.png; done
$ time ../similarity.rb ref.png *.png
real    2m44.294s
user    7m30.696s
sys 0m20.780s
peak mem 270mb

如果您愿意考虑使用 Python，它会快很多，因为它会进行引用计数并且不需要一直扫描。

import sys
from gi.repository import Vips

# find normalised histogram of reference image
ref = Vips.Image.new_from_file(sys.argv[1], access = Vips.Access.SEQUENTIAL)
ref_hist = ref.hist_find().hist_norm()

for filename in sys.argv[2:]:
    # find sample hist
    sample = Vips.Image.new_from_file(filename, access = Vips.Access.SEQUENTIAL)
    sample_hist = sample.hist_find().hist_norm()

    # calculate sum of squares of difference, if it's over a threshold, print
    # the filename
    diff_hist = (ref_hist - sample_hist) ** 2
    diff = diff_hist.avg() * diff_hist.width * diff_hist.height

    if diff > 100:
        print filename, ", ", diff

我明白了：

$ time ../similarity.py ref.png *.png
real    1m4.001s
user    1m3.508s
sys 0m10.060s
peak mem 58mb

【讨论】：