从 PDF 上的 Google Vision API OCR 获取线条和段落，而不是符号答案

【问题标题】：Get Lines and Paragraphs, not symbols from Google Vision API OCR on PDF从 PDF 上的 Google Vision API OCR 获取线条和段落，而不是符号
【发布时间】：2025-12-19 11:20:19
【问题描述】：

我正在尝试使用 Google Cloud Vision API 现在支持的 PDF/TIFF 文档文本检测。使用他们的示例代码，我可以提交 PDF 并接收带有提取文本的 JSON 对象。我的问题是保存到 GCS 的 JSON 文件仅包含边界框和“符号”文本，即每个单词中的每个字符。这使得 JSON 对象非常笨重且难以使用。我希望能够获得“LINES”、“PARAGRAPHS”和“BLOCKS”的文本和边界框，但我似乎无法通过AsyncAnnotateFileRequest() 方法找到一种方法。

示例代码如下：

def async_detect_document(gcs_source_uri, gcs_destination_uri):
    """OCR with PDF/TIFF as source files on GCS"""
    # Supported mime_types are: 'application/pdf' and 'image/tiff'
    mime_type = 'application/pdf'

    # How many pages should be grouped into each json output file.
    batch_size = 2

    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
    input_config = vision.types.InputConfig(
        gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.types.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size)

    async_request = vision.types.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config,
        output_config=output_config)

    operation = client.async_batch_annotate_files(
        requests=[async_request])

    print('Waiting for the operation to finish.')
    operation.result(timeout=180)

    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    storage_client = storage.Client()

    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name=bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)

    # Process the first output file from GCS.
    # Since we specified batch_size=2, the first response contains
    # the first two pages of the input file.
    output = blob_list[0]

    json_string = output.download_as_string()
    response = json_format.Parse(
        json_string, vision.types.AnnotateFileResponse())

    # The actual response for the first page of the input file.
    first_page_response = response.responses[0]
    annotation = first_page_response.full_text_annotation

    # Here we print the full text from the first page.
    # The response contains more information:
    # annotation/pages/blocks/paragraphs/words/symbols
    # including confidence scores and bounding boxes
    print(u'Full text:\n{}'.format(
        annotation.text))

【问题讨论】：

*.com/questions/42391009/…

标签： python google-cloud-platform google-cloud-vision

【解决方案1】：

不幸的是，当使用DOCUMENT_TEXT_DETECTION 类型时，您只能获得每页的全文或单个符号。不过，将符号中的段落和行放在一起并不难，这样的事情应该可以工作（从您的示例扩展）：

breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
paragraphs = []
lines = []

for page in annotation.pages:
    for block in page.blocks:
        for paragraph in block.paragraphs:
            para = ""
            line = ""
            for word in paragraph.words:
                for symbol in word.symbols:
                    line += symbol.text
                    if symbol.property.detected_break.type == breaks.SPACE:
                        line += ' '
                    if symbol.property.detected_break.type == breaks.EOL_SURE_SPACE:
                        line += ' '
                        lines.append(line)
                        para += line
                        line = ''
                    if symbol.property.detected_break.type == breaks.LINE_BREAK:
                        lines.append(line)
                        para += line
                        line = ''
            paragraphs.append(para)

print(paragraphs)
print(lines)

【讨论】：

此解决方案与 annotation.Text 属性相同，该属性已内置。
不，它没有：最初的问题是使用annotation.text，但这正是他们所问的问题：它没有将响应分解为行和段落。这个解决方案可以。
就我而言，我从annotation.text 和您的代码中得到了相同的结果。不要误会我的意思，我喜欢中断类型过滤，这就是我投这个答案的原因，但它并没有改善我的输出。
是的，结果是一样的，问题是关于结果的结构。
我发现有关此代码的一件事是symbol.property 不存在，这会触发AttributeError。所以我用try/except AttributeError 块包裹了if symbol.property... 行，并用pass 忽略了错误。