检测图像中的单词和图形，并将图像切片为每个单词或图形 1 个图像答案

【问题标题】：Detect words and graphs in image and slice image into 1 image per word or graph检测图像中的单词和图形，并将图像切片为每个单词或图形 1 个图像
【发布时间】：2020-01-05 23:28:53
【问题描述】：

我正在构建一个网络应用程序来帮助学生学习数学。

应用需要显示来自 LaTex 文件的数学内容。这些 Latex 文件（漂亮地）呈现为 pdf，借助 pdf2svg，我可以干净地转换为 svg。

（svg 或 png 或任何图像格式）图像看起来像这样：

 _______________________________________
|                                       |
| 1. Word1 word2 word3 word4            |
|    a. Word5 word6 word7               |
|                                       |
|   ///////////Graph1///////////        |
|                                       |
|    b. Word8 word9 word10              |
|                                       |
| 2. Word11 word12 word13 word14        |
|                                       |
|_______________________________________|

真实例子：

Web 应用程序的意图是对其进行操作和添加内容，从而导致如下所示：

 _______________________________________
|                                       |
| 1. Word1 word2                        | <-- New line break
|_______________________________________|
|                                       |
| -> NewContent1                        |  
|_______________________________________|
|                                       |
|   word3 word4                         |  
|_______________________________________|
|                                       |
| -> NewContent2                        |  
|_______________________________________|
|                                       |
|    a. Word5 word6 word7               |
|_______________________________________|
|                                       |
|   ///////////Graph1///////////        |
|_______________________________________|
|                                       |
| -> NewContent3                        |  
|_______________________________________|
|                                       |
|    b. Word8 word9 word10              |
|_______________________________________|
|                                       |
| 2. Word11 word12 word13 word14        |
|_______________________________________|

例子：

单张大图无法让我灵活地进行此类操作。

但如果图像文件被分解成包含单个单词和单个图表的较小文件，我可以进行这些操作。

我认为我需要做的是检测图像中的空白，并将图像切成多个子图像，看起来像这样：

 _______________________________________
|          |       |       |            |
| 1. Word1 | word2 | word3 | word4      |
|__________|_______|_______|____________|
|             |       |                 |
|    a. Word5 | word6 | word7           |
|_____________|_______|_________________|
|                                       |
|   ///////////Graph1///////////        |
|_______________________________________|
|             |       |                 |
|    b. Word8 | word9 | word10          |
|_____________|_______|_________________|
|           |        |        |         |
| 2. Word11 | word12 | word13 | word14  |
|___________|________|________|_________|

我正在寻找一种方法来做到这一点。您认为要走的路是什么？

感谢您的帮助！

【问题讨论】：

垂直和水平投影。首先将整个图像分割成行，然后将每一行分割成列。
谢谢丹。我明白了。您将使用什么工具进行垂直和水平投影？可以自动化吗？它可以检测行和列吗？
你所做的基本上是计算每行的平均强度（例如使用cv2.reduce。用它来识别行之间的白色间隙。找到间隙的中点。使用这些作为切点来生成一组图像，每行一个文本/图形。现在每列重复相同的内容。

标签： opencv image-processing whitespace edge-detection

【解决方案1】：

我会使用水平和垂直投影首先将图像分割成线，然后将每条线分割成更小的切片（例如单词）。

首先将图像转换为灰度，然后将其反转，以便间隙包含零，并且任何文本/图形都是非零的。

img = cv2.imread('article.png', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray_inverted = 255 - img_gray

使用cv2.reduce 计算水平投影——每行的平均强度，并将其展平为线性阵列。

row_means = cv2.reduce(img_gray_inverted, 1, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()

现在找到所有连续间隙的行范围。可以使用this answer提供的功能。

row_gaps = zero_runs(row_means)

最后计算间隙的中点，我们将使用它来切割图像。

row_cutpoints = (row_gaps[:,0] + row_gaps[:,1] - 1) / 2

您最终会遇到这种情况（间隙为粉红色，切点为红色）：

下一步将处理每个已识别的行。

bounding_boxes = []
for n,(start,end) in enumerate(zip(row_cutpoints, row_cutpoints[1:])):
    line = img[start:end]
    line_gray_inverted = img_gray_inverted[start:end]

计算垂直投影（每列的平均强度），找到间隙和切点。此外，计算间隙大小，以滤除单个字母之间的小间隙。

column_means = cv2.reduce(line_gray_inverted, 0, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
column_gaps = zero_runs(column_means)
column_gap_sizes = column_gaps[:,1] - column_gaps[:,0]
column_cutpoints = (column_gaps[:,0] + column_gaps[:,1] - 1) / 2

过滤切割点。

filtered_cutpoints = column_cutpoints[column_gap_sizes > 5]

并为每个段创建一个边界框列表。

for xstart,xend in zip(filtered_cutpoints, filtered_cutpoints[1:]):
    bounding_boxes.append(((xstart, start), (xend, end)))

现在你会得到这样的结果（同样，间隙是粉红色的，切点是红色的）：

现在您可以剪切图像了。我将可视化找到的边界框：

完整的脚本：

import cv2
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec


def plot_horizontal_projection(file_name, img, projection):
    fig = plt.figure(1, figsize=(12,16))
    gs = gridspec.GridSpec(1, 2, width_ratios=[3,1])

    ax = plt.subplot(gs[0])
    im = ax.imshow(img, interpolation='nearest', aspect='auto')
    ax.grid(which='major', alpha=0.5)

    ax = plt.subplot(gs[1])
    ax.plot(projection, np.arange(img.shape[0]), 'm')
    ax.grid(which='major', alpha=0.5)
    plt.xlim([0.0, 255.0])
    plt.ylim([-0.5, img.shape[0] - 0.5])
    ax.invert_yaxis()

    fig.suptitle("FOO", fontsize=16)
    gs.tight_layout(fig, rect=[0, 0.03, 1, 0.97])  

    fig.set_dpi(200)

    fig.savefig(file_name, bbox_inches='tight', dpi=fig.dpi)
    plt.clf() 

def plot_vertical_projection(file_name, img, projection):
    fig = plt.figure(2, figsize=(12, 4))
    gs = gridspec.GridSpec(2, 1, height_ratios=[1,5])

    ax = plt.subplot(gs[0])
    im = ax.imshow(img, interpolation='nearest', aspect='auto')
    ax.grid(which='major', alpha=0.5)

    ax = plt.subplot(gs[1])
    ax.plot(np.arange(img.shape[1]), projection, 'm')
    ax.grid(which='major', alpha=0.5)
    plt.xlim([-0.5, img.shape[1] - 0.5])
    plt.ylim([0.0, 255.0])

    fig.suptitle("FOO", fontsize=16)
    gs.tight_layout(fig, rect=[0, 0.03, 1, 0.97])  

    fig.set_dpi(200)

    fig.savefig(file_name, bbox_inches='tight', dpi=fig.dpi)
    plt.clf() 

def visualize_hp(file_name, img, row_means, row_cutpoints):
    row_highlight = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    row_highlight[row_means == 0, :, :] = [255,191,191]
    row_highlight[row_cutpoints, :, :] = [255,0,0]
    plot_horizontal_projection(file_name, row_highlight, row_means)

def visualize_vp(file_name, img, column_means, column_cutpoints):
    col_highlight = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    col_highlight[:, column_means == 0, :] = [255,191,191]
    col_highlight[:, column_cutpoints, :] = [255,0,0]
    plot_vertical_projection(file_name, col_highlight, column_means)


# From https://stackoverflow.com/a/24892274/3962537
def zero_runs(a):
    # Create an array that is 1 where a is 0, and pad each end with an extra 0.
    iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
    absdiff = np.abs(np.diff(iszero))
    # Runs start and end where absdiff is 1.
    ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
    return ranges


img = cv2.imread('article.png', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray_inverted = 255 - img_gray

row_means = cv2.reduce(img_gray_inverted, 1, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
row_gaps = zero_runs(row_means)
row_cutpoints = (row_gaps[:,0] + row_gaps[:,1] - 1) / 2

visualize_hp("article_hp.png", img, row_means, row_cutpoints)

bounding_boxes = []
for n,(start,end) in enumerate(zip(row_cutpoints, row_cutpoints[1:])):
    line = img[start:end]
    line_gray_inverted = img_gray_inverted[start:end]

    column_means = cv2.reduce(line_gray_inverted, 0, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
    column_gaps = zero_runs(column_means)
    column_gap_sizes = column_gaps[:,1] - column_gaps[:,0]
    column_cutpoints = (column_gaps[:,0] + column_gaps[:,1] - 1) / 2

    filtered_cutpoints = column_cutpoints[column_gap_sizes > 5]

    for xstart,xend in zip(filtered_cutpoints, filtered_cutpoints[1:]):
        bounding_boxes.append(((xstart, start), (xend, end)))

    visualize_vp("article_vp_%02d.png" % n, line, column_means, filtered_cutpoints)

result = img.copy()

for bounding_box in bounding_boxes:
    cv2.rectangle(result, bounding_box[0], bounding_box[1], (255,0,0), 2)

cv2.imwrite("article_boxes.png", result)

【讨论】：

谢谢丹，这超出了我的预期！
如果我理解正确，OpenCV 无法加载和写入 .svg 文件？它将允许在任何规模上完美显示。 OpenCV 是否可以处理任何矢量图像格式？
据我所知，it can't。当您考虑它时，除非您渲染它，否则它不会是光栅图像，因此方法可能需要有所不同。（TBH，我需要做一些研究才能给你一个很好的答案）虽然想到了一种可能性，但这只是一个快速的想法——使用当前方法渲染并找到边界框，然后使用坐标找到相应的 SVG 片段。
这很有意义。我将研究这个方向（使用 opencv 和 slice svg 检测边界框）。真是太感谢你了！
@CiprianTomoiaga 是的。在这种情况下就足够了，因为输入图像是计算机生成的（因此不包含任何噪声）。

【解决方案2】：

图像质量一流，非常干净，没有歪斜，字符分离良好。一个梦想！

首先执行二值化和斑点检测（OpenCV 中的标准）。

然后通过将纵坐标上重叠的字符（即在一行中彼此面对）进行分组来对字符进行聚类。这自然会隔离各个行。

现在，在每一行中，从左到右对 blob 进行排序，并按邻近度进行聚类以隔离单词。这将是一个微妙的步骤，因为一个单词中的字符间距接近于不同单词之间的间距。不要期望完美的结果。这应该比投影效果更好。

斜体的情况更糟，因为水平间距更窄。您可能还需要查看“倾斜距离”，即找到在斜体方向上与字符相切的线。这可以通过应用反向剪切变换来实现。

感谢网格，图表将显示为大斑点。

【讨论】：

谢谢 Yves，我会调查一下