使用 C++ 或 Python 将表格 PDF 数据转换为文本（或任何其他可读格式）文件答案

【问题标题】：Convert a Tabled PDF data into a text (or any other readable format) file using C++ or Python使用 C++ 或 Python 将表格 PDF 数据转换为文本（或任何其他可读格式）文件
【发布时间】：2021-11-12 20:51:05
【问题描述】：

我有一个包含大学时间表的 PDF 文件，由 aSc Timetables 软件生成。

数据看起来像这样，

PDF 文件中大约有 29 个这样的页面。

我想为一个程序处理这些数据，因此，希望它在任何编程语言中都是可读的，最好是 C++ 或 Python 语言。

谁能指导我怎么做？也许我可以使用一些库将这些数据转换为使用 C++ 的文本文件？

我需要的数据就是这种形式，

假设在 C++ 中，我们有一个名为 Section 的类（一个对象将代表每个部分，例如“BCS-1A”的对象或“BCS-7E”的对象等.)

所以，对于 BCS-1A

Section Object: 

section_name: "BCS-1A" // (section_name is a string data member)
// There will be 7 arrays, each representing one day of the week and each array will be of size 16. One index of the array will represent one time slot of that day. So, in this case, 

moday_schedule[16]; // it will be an **linked list** array of 16 size. Each index can be empty or may contain as many slots as possible. Each index represents the time slot in the timetable. For example "0th" index will represent the time slot of 8:45 to 9:15, 16th index will represent 4:15 to 4:40 and etc. 

// For example, monday_schedule[0] will be EMPTY.
// monday_schedule[4] will contain an object that will have following information,

// Subject: Digital Logic Design
// Teacher: Mirza Waqar Baig
// Sub-section: None (there is a sub-section in some lectures)
// Room: R-5

// monday_schedule[5] will also contain same information

// monday_schedule[12] will have two objects.
// and both the objects will have an attribute of "Sub-section" as well

【问题讨论】：

如果您想使用 Python 并将表格转换为可编辑的 docx 表格，您可以使用 pdf2docx 我之前使用过：dothinking.github.io/pdf2docx/quickstart.convert.html
当心 PDF 是一种相当高级的格式，它可以包含 text（可以很容易地被各种库解析）和/或 images raster。在后一种用例中，您首先需要一个光学字符识别工具。但在大多数情况下，当我这样做时，我只得到了必须人工审核的第一张照片。
@tako0707 我尝试使用 pdf2docx 库，但它对我不起作用。它打开了文档，但在解析阶段，它给出了这个错误，[警告]由于错误而忽略页面：'int'对象没有属性'value'。我不知道如何解决这个问题：/
@SergeBallesta 你能推荐任何可以在这个过程中提供帮助的库或任何其他教程吗？会很感激你的。
@TalhaAyub：对于这类问题，Google 更适合 SO... 除了众所周知的 tesseract 库，谷歌搜索 Python OCR 或 python ocr pdf 应该会给你一些有趣的起点。

标签： python c++ python-3.x file pdf

【解决方案1】：

我在 GitHub 上编译了一个 repository

我使用pdf2image 首先将 pdf 转换为图像文件并将这些文件存储在图像文件夹中。
然后使用pytesseract将这些图像转换为txt文件并将这些txt文件存储在texts文件夹中。
之后，我将文本稍微格式化并以 csv 格式存储在 csvs 文件夹中。

【讨论】：