文本提取 - 逐行答案

【问题标题】：Text extraction - line-by-line文本提取 - 逐行
【发布时间】：2017-02-22 12:06:41
【问题描述】：

我正在使用 Google Vision API，主要用于提取文本。我工作得很好，但对于我需要 API 扫描输入行的特定情况，在移动到下一行之前吐出文本。但是，API 似乎正在使用某种逻辑，使其在左侧从上到下扫描，然后移动到右侧并进行从上到下的扫描。如果 API 从左到右读取、向下移动等等，我会很高兴的。

例如，考虑图像：

API 返回如下文本：

“ Name DOB Gender: Lives In John Doe 01-Jan-1970 LA ”

然而，我本来希望是这样的：

“ Name: John Doe DOB: 01-Jan-1970 Gender: M Lives In: LA ”

我想有一种方法可以定义块大小或边距设置（？）以逐行读取图像/扫描？

感谢您的帮助。亚历克斯

【问题讨论】：

标签： google-cloud-vision google-vision

【解决方案1】：

这可能是一个较晚的答案，但添加它以供将来参考。您可以向 JSON 请求添加功能提示以获得所需的结果。

{
  "requests": [
    {
      "image": {
        "source": {
          "imageUri": "https://i.stack.imgur.com/TRTXo.png"
        }
      },
      "features": [
        {
          "type": "DOCUMENT_TEXT_DETECTION"
        }
      ]
    }
  ]
}

对于相距甚远的文本，DOCUMENT_TEXT_DETECTION 也不提供正确的行分割。

下面code根据字符多边形坐标做简单的线段分割。

https://github.com/sshniro/line-segmentation-algorithm-to-gcp-vision

【讨论】：

我看到了这段代码，它读起来很短，但我想在 Java 中使用它，如何隐藏它？
语法大致相同。该算法使用多边形计算库，因此应该使用类似的库来确定点是否在 Java 中的多边形内。
谢谢，我在java中用：两个矩形的空间重叠
这个 javascript 代码对我有用，但我可以为 python 获得相同的代码吗？

【解决方案2】：

这里有一个简单的代码，可以逐行阅读。 y 轴表示行，x 轴表示行中的每个单词。

items = []
lines = {}

for text in response.text_annotations[1:]:
    top_x_axis = text.bounding_poly.vertices[0].x
    top_y_axis = text.bounding_poly.vertices[0].y
    bottom_y_axis = text.bounding_poly.vertices[3].y

    if top_y_axis not in lines:
        lines[top_y_axis] = [(top_y_axis, bottom_y_axis), []]

    for s_top_y_axis, s_item in lines.items():
        if top_y_axis < s_item[0][1]:
            lines[s_top_y_axis][1].append((top_x_axis, text.description))
            break

for _, item in lines.items():
    if item[1]:
        words = sorted(item[1], key=lambda t: t[0])
        items.append((item[0], ' '.join([word for _, word in words]), words))

print(items)

【讨论】：

出于某种原因，谷歌视觉分割总数。例如：161.765,31。它将它分成五个词 [161, ., 765, ,, 31]。我是否缺少配置？
图片稍微旋转一下就不行了

【解决方案3】：

您也可以根据每行的边界提取文本，您可以使用 boundyPoly 并将文本连接在同一行中

"boundingPoly": {
        "vertices": [
          {
            "x": 87,
            "y": 148
          },
          {
            "x": 411,
            "y": 148
          },
          {
            "x": 411,
            "y": 206
          },
          {
            "x": 87,
            "y": 206
          }
        ]

例如这两个词在同一“行”中

"description": "you",
      "boundingPoly": {
        "vertices": [
          {
            "x": 362,
            "y": 1406
          },
          {
            "x": 433,
            "y": 1406
          },
          {
            "x": 433,
            "y": 1448
          },
          {
            "x": 362,
            "y": 1448
          }
        ]
      }
    },
    {
      "description": "start",
      "boundingPoly": {
        "vertices": [
          {
            "x": 446,
            "y": 1406
          },
          {
            "x": 540,
            "y": 1406
          },
          {
            "x": 540,
            "y": 1448
          },
          {
            "x": 446,
            "y": 1448
          }
        ]
      }
    }

【讨论】：

谢谢，这是一种可能。

【解决方案4】：

Java 用户 (full code)

我只是通过一些高度阈值来构建段落。我们为每个文本获取每个边界矩形的左下顶点，并确定哪些矩形属于同一段落（具有预定义高度的水平线）

    // lower vertex of the bounding rectangle -> description
    Map<Vertex, String> vertexToText = new HashMap<>();
    result.getTextAnnotationsList()
        .forEach(annotation ->
            vertexToText.put(annotation.getBoundingPoly().getVerticesList().get(2),
                annotation.getDescription()));

接下来我们需要将文本分组为段落

/**
 * Vertex to text is grouped by the defined paragraph height
 * The resulting map is paragraph grouped y -> map vertex x to text
 */
private Map<Integer, Map<Integer, String>> groupParagraphs(Map<Vertex, String> vertexToText) {

    Map<Integer, Map<Integer, String>> paragraphGroups = new HashMap<>();
    vertexToText.forEach((k, v) -> {
        // this is the paragraph 'bucket'
        Integer key = k.getY() / PARAGRAPH_HEIGHT;
        if (paragraphGroups.containsKey(key)) {
            paragraphGroups.get(key).put(k.getX(), v);
        } else {
            Map<Integer, String> newXToText = new HashMap<>();
            newXToText.put(k.getX(), v);
            paragraphGroups.put(key, newXToText);
        }
    });

    return paragraphGroups;

}

然后我们可以将结果提取为字符串列表

private List<String> extractParagraphs(AnnotateImageResponse result) {
    // lower vertex of the bounding rectangle -> description
    Map<Vertex, String> vertexToText = new HashMap<>();
    result.getTextAnnotationsList()
        .forEach(annotation ->
            vertexToText.put(annotation.getBoundingPoly().getVerticesList().get(2),
                annotation.getDescription()));

    Map<Integer, Map<Integer, String>> paragraphGroups = groupParagraphs(vertexToText);

    return paragraphGroups
        .values()
        .stream()
        .map(this::orderedByX)
        .map(lst -> String.join(" ", lst))
        .toList();
}

orderByX 方法只是按 x 顶点对文本进行排序，以便我们得到有意义的消息

/**
 * x to text map is are essentially all the texts in that paragraph
 * Problems is they are shuffled, we need to order them by x to get a
 * meaningful sentence
 */
private List<String> orderedByX(Map<Integer, String> xToText) {
    return xToText
        .entrySet()
        .stream()
        .sorted(Map.Entry.comparingByKey())
        .map(Map.Entry::getValue)
        .toList();
}

【讨论】：

【解决方案5】：

受 Borislav 的回答启发，我刚刚为 python 写了一些也适用于手写的东西。这很混乱，而且我是 python 新手，但我认为您可以了解如何执行此操作。

一个类来保存每个单词的一些扩展数据，例如一个单词的平均y位置，我用它来计算单词之间的差异：

import re
from operator import attrgetter

import numpy as np

class ExtendedAnnotation:
    def __init__(self, annotation):
        self.vertex = annotation.bounding_poly.vertices
        self.text = annotation.description
        self.avg_y = (self.vertex[0].y + self.vertex[1].y + self.vertex[2].y + self.vertex[3].y) / 4
        self.height = ((self.vertex[3].y - self.vertex[1].y) + (self.vertex[2].y - self.vertex[0].y)) / 2
        self.start_x = (self.vertex[0].x + self.vertex[3].x) / 2

    def __repr__(self):
        return '{' + self.text + ', ' + str(self.avg_y) + ', ' + str(self.height) + ', ' + str(self.start_x) + '}'

使用该数据创建对象：

def get_extended_annotations(response):
    extended_annotations = []
    for annotation in response.text_annotations:
        extended_annotations.append(ExtendedAnnotation(annotation))

    # delete last item, as it is the whole text I guess.
    del extended_annotations[0]
    return extended_annotations

计算阈值。
首先，所有单词 a 按它们的 y 位置排序，定义为单词所有 4 个角的平均值。 x 位置目前不相关。然后，计算每个单词与其下一个单词之间的差异。对于一条完全直线的单词，你会期望每两个单词之间 y 位置的差异为 0。即使对于手写，它也应该在 1 ~ 10 左右。
但是，每当出现换行符时，前一行的最后一个单词和新行的第一个单词之间的差异就远大于此，例如50或60。
所以要判断两个词之间是否应该换行，就要使用差值的标准差。

def get_threshold_for_y_difference(annotations):
    annotations.sort(key=attrgetter('avg_y'))
    differences = []
    for i in range(0, len(annotations)):
        if i == 0:
            continue
        differences.append(abs(annotations[i].avg_y - annotations[i - 1].avg_y))
    return np.std(differences)

计算阈值后，所有单词的列表会相应地分组到行中。

def group_annotations(annotations, threshold):
    annotations.sort(key=attrgetter('avg_y'))
    line_index = 0
    text = [[]]
    for i in range(0, len(annotations)):
        if i == 0:
            text[line_index].append(annotations[i])
            continue
        y_difference = abs(annotations[i].avg_y - annotations[i - 1].avg_y)
        if y_difference > threshold:
            line_index = line_index + 1
            text.append([])
        text[line_index].append(annotations[i])
    return text

最后，每一行都按它们的 x 位置排序，以使它们从左到右按正确的顺序排列。
然后用一个小正则表达式来删除标点前面的空格。

def sort_and_combine_grouped_annotations(annotation_lists):
    grouped_list = []
    for annotation_group in annotation_lists:
        annotation_group.sort(key=attrgetter('start_x'))
        texts = (o.text for o in annotation_group)
        texts = ' '.join(texts)
        texts = re.sub(r'\s([-;:?.!](?:\s|$))', r'\1', texts)
        grouped_list.append(texts)
    return grouped_list

【讨论】：