Python - 将 PDF 数据解析为表格格式答案

【问题标题】：Python - Parsing PDF Data into a Table FormatPython - 将 PDF 数据解析为表格格式
【发布时间】：2018-08-31 04:09:18
【问题描述】：

我正在尝试在此处复制 PDF 中表格中的数据：http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf

我当前的代码只提取第一个表的第二页，即文档中的第 11 页（标记为第 2 页）。这是我正在使用的代码：

import io, re
import PyPDF2
import requests

url = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(10).extractText()

data = re.sub( r"([A-Z])", r" \1", contents).split()

csv = open('AWStest.csv', 'w')
csv.write(contents)
csv.close()

我目前能够以粗略的 CSV 格式提取数据，但无法弄清楚如何解析数据以允许我将其存储以匹配从中提取的表。这是它目前的样子，所有间隔都是 CSV 格式的换行符：

Col元素数据元素名称日期 ModifiedFormatLengthDescriptionElement 提交指南注释健康）状况（分母）推荐的临界点成员合格数据内容指导 2013 年 12 月 5 日 3ME003保险类型代码/产品 2013 年 4 月 1 日查找表格

文字 2型 /产品鉴别代码报告这代码定义这类型保险在下面哪一个这会员的合格被维护。例子： HM=HMO 代码描述 9自我支付 11其他非联邦程式 *（采用这个的价值需要披露到数据管理器事先的提交） 12首选提供者组织 (PPO) *13点服务的（位置） *14独家提供者组织 (欧洲专利局) *15赔偿保险 16健康维护组织 (HMO) 医疗保险风险（采用报告医疗保险 C部分/医疗保险优势计划） 17牙科维护组织 (DMO) *96哈士奇健康 A97哈士奇健康 B98哈士奇健康 C99哈士奇健康 DAM汽车医疗的 *CHChampus （现在特里卡） *DS残疾 *HMHealth 维护组织 *LMLiability 医疗的 MAMedicare A部分（医疗保险费用服务只要）医疗保险 B部分*（医保费用服务只要） MC医疗补助 *医疗保险部分DOF其他联邦程序 （使用这个的价值需要披露到数据管理器事先的提交）电视片名 VVV退伍军人事务计划 *WCWorkers' 赔偿 *ZZ相互定义 *（采用这个的价值需要披露到数据管理器事先的提交）全部96.0%

此示例数据代表标题行和第一行数据。我已经能够根据大写分解单词，但不幸的是，它也将完全大写的单词分解为单个字母。我使用了这段代码：

fcsv = open('AWStest.csv', 'r')

for line in fcsv.readlines():
    line = line.strip()
    line.split('[a-zA-Z][^A-Z]*')
    print(re.findall('[A-Z][^A-Z]*', line))

我需要帮助找出以允许我将其加载到 NoSQL 数据库并查询各个行的要求以生成报告的格式重现此完整表的最佳方法。为了做到这一点，添加到我的代码中的最佳方法是什么？有没有更好的方法来以更准确的格式报废 PDF？

【问题讨论】：

标签： python parsing pdf web-scraping

【解决方案1】：

听起来页面上的文本位置会对您有很大帮助。我建议使用PyMuPDF 提取带有位置数据的文本，以便您找到一行。

这是一个代码示例，用于获取带有位置的文本 *.csv 文件。希望这可以帮助您开始使用 Python 挖掘信息。

#!python3.3
""" Use PyMuPDF to extract text to *.csv file. """
import csv
import json
import os
import sys

import fitz

assert len(sys.argv) == 2, 'Pass file name as parameter'

srcfilename = sys.argv[1]
assert os.path.isfile(srcfilename), 'File {} does not exist'.format(srcfilename)

dstfilename = '{}.csv'.format(srcfilename)
with open(dstfilename, 'w', encoding='utf-8', errors='ignore', newline='') as dstfile:
    writer = csv.writer(dstfile)
    writer.writerow([
        'PAGE',
        'X1',
        'Y1',
        'X2',
        'Y2',
        'TEXT',
    ])
    document = fitz.open(srcfilename)
    for page_number in range(document.pageCount):
        text_dict = json.loads(document.getPageText(page_number, output='json'))
        for block in text_dict['blocks']:
            if block['type'] != 'text':
                continue
            for line in block['lines']:
                for span in line['spans']:
                    writer.writerow([
                        page_number,
                        span['bbox'][0],
                        span['bbox'][1],
                        span['bbox'][2],
                        span['bbox'][3],
                        span['text'],
                    ])
    document.close()

这是我编写的一些代码，用于挖掘您的 PDF 并将内容放入格式更好的 *.csv 文件中：

#!python3.3
import collections
import csv
import json
import os

import fitz  # PyMuPDF package


class MemberEligibility(object):

    """ Row in Member Eligibility Data Contents Guide table. """

    def __init__(self):
        """
        Initialize object. I've made all fields strings but you may want some to
        be dates or integers.
        """
        self.col = ''
        self.element = ''
        self.data_element_name = ''
        self.date_modified = ''
        self.fmt = ''
        self.length = ''
        self.description = ''
        self.comments = ''
        self.condition = ''
        self.recommended_threshold = ''


def get_sorted_list(document, page_number):
    """
    Get text on specified page of document in sorted list. Each list item is a
    (top-left y-coordinate, top-left x-coordinate, text) tuple. List sorted
    top-to-bottom and then left-to-right. Coordinates converted to integers so
    text with slightly different y-coordinates line up.
    """
    text_dict = json.loads(document.getPageText(page_number, output='json'))
    text_list = []
    for block in text_dict['blocks']:
        if block['type'] == 'text':
            for line in block['lines']:
                for span in line['spans']:
                    text_list.append((
                        int(span['bbox'][1]),  # Top-left y-coordinate
                        int(span['bbox'][0]),  # Top-left x-coordinate
                        span['text'],          # Text itself
                    ))
    text_list.sort()
    return text_list


def main():
    # Downloaded PDF to same folder as this script
    script_dir = os.path.dirname(os.path.abspath(__file__))
    pdf_filename = os.path.join(
        script_dir,
        'CT_DSG_-12132014_version_1.2_(with_clarifications).pdf'
    )

    # Mine PDF for data
    document = fitz.open(pdf_filename)
    # Using OrderedDict so iteration will occur in same order as rows appear in
    # PDF
    member_eligibility_dict = collections.OrderedDict()
    for page_number in range(document.pageCount):
        # Page numbers are zero-based. I'm only looking at p. 11 of PDF here.
        if 10 <= page_number <= 10:
            text_list = get_sorted_list(document, page_number)
            for y, x, text in text_list:
                if 115 < y < 575:
                    # Only look at text whose y-coordinates are within the data
                    # portion of the table
                    if 25 < x < 72:
                        # Assuming one row of text per cell in this column but
                        # this doesn't appear to hold on p. 10 of PDF so may
                        # need to be modified if you're going to do whole table
                        row = MemberEligibility()
                        row.col = text
                        member_eligibility_dict[row.col] = row
                    elif 72 < x < 118:
                        row.element += text
                    elif 118 < x < 175:
                        row.data_element_name += text
                    elif 175 < x < 221:
                        row.date_modified += text
                    elif 221 < x < 268:
                        row.fmt += text
                    elif 268 < x < 315:
                        row.length += text
                    elif 315 < x < 390:
                        row.description += text
                    elif 390 < x < 633:
                        row.comments += text
                    elif 633 < x < 709:
                        row.condition += text
                    elif 709 < x < 765:
                        row.recommended_threshold += text
    document.close()

    # Write data to *.csv
    csv_filename = os.path.join(script_dir, 'EligibilityDataContentsGuide.csv')
    with open(csv_filename, 'w', encoding='utf-8', errors='ignore', newline='') as f:
        writer = csv.writer(f)
        writer.writerow([
            'Col',
            'Element',
            'Data Element Name',
            'Date Modified',
            'Format',
            'Length',
            'Description',
            'Element Submission Guideline Comments',
            'Condition (Denominator)',
            'Recommended Threshold'
        ])
        for row in member_eligibility_dict.values():
            writer.writerow([
                row.col,
                row.element,
                row.data_element_name,
                row.date_modified,
                row.fmt,
                row.length,
                row.description,
                row.comments,
                row.condition,
                row.recommended_threshold
            ])


if __name__ == '__main__':
    main()

你可能需要做更多的工作才能得到你想要的。

【讨论】：

所以我是 Python 新手，并被要求尝试创建一个没有真正语言背景的解决方案，如果每个单元格中的文本数量可以各不相同？我看不出我如何能够动态地提取数据并将其加载到具有类似结构的新数据库中。感谢您的帮助！
文本的x坐标会告诉你文本属于哪一列。 y 坐标和一些逻辑会告诉你文本属于哪一行。您设想的数据库结构是什么？
我更新了我的答案以演示如何挖掘 PDF。看起来一列中的文本有时与另一列中的文本属于同一文本块。您可能需要在我的示例中添加一些代码才能得到您想要的。