使用 PyPDF2 提取文本时的编码问题答案

【问题标题】：Encoding problems when extracting text with PyPDF2使用 PyPDF2 提取文本时的编码问题
【发布时间】：2021-02-16 13:24:55
【问题描述】：

我正在使用 PyPDF2 从 pdf 文件中提取文本。它可以工作，但它不理解重音字符。

这是我的代码：

filename ='document.pdf' 

#open allows you to read the file
pdfFileObj = open(filename,'rb')

#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

#discerning the number of pages will allow us to parse through all the pages
num_pages = pdfReader.numPages


count = 0
text = ""

#The while loop will read each page
while count < num_pages:                      
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
    
if text != "":
    text = text

这是我得到的结果：

82 %G’nes dues au bruitEurop”ens expos”s ‹ des seuils 
au del‹ de 55 dB.125 MDes habitants dÕIle de France expos”s ‹ des 
valeurs sup”rieures recommand”s par lÕOMS.
90 %Des fran“ais se disent pr”occup”s par 
les questions relatives au bruit.
82 %Personnes d”clarent ’tre g’n”s par des 
nuisances sonores ‹ leur domicile.
45 %Les effets du bruit caus”s chaque ann”es
Les effets du bruit caus”s chaque ann”es
Personnes g’n”es par le bruit.

这就是 pdf 的样子：

【问题讨论】：

这是 Python3 还是 Python2？
在 Python3 中
@Soph 你能在问题中发布一些你的文字吗？
这可能是一个不切实际的想法，但您确定 PDF 的文本内容实际上与屏幕上的图像匹配吗？文本是 PDF 存储在与图像版本不同的图层中，因此如果底层文本图层错误，通常不可见。如果在创建 PDF 时文本编码错误，您将不会从中获得任何有用的信息，您必须改为对图像层进行 OCR（例如 tesseract）。
https://ftfy.readthedocs.io/en/latest/ 这个模块可能有用

标签： python pdf text-extraction pypdf2

【解决方案1】：

我认为这不是您的代码。我认为这是 PDF 的问题。

我使用示例 PDF 文件进行了检查：https://www.languagebird.com/wp-content/uploads/2019/10/sample_French_Basics_Grammar_Book-2017-3.pdf

#! /usr/bin/env python
# -*- coding: utf-8 -*-

# Note you don't need to manually open a file object
# You can pass a string reference to a file
pdfReader = PyPDF2.PdfFileReader('sample_French_Basics_Grammar_Book-2017-3.pdf')

text = ""

# Better to loop through the pages using the iterator
# Rather than manual count
for current_page in pdfReader.pages:
    text += current_page.extractText()

# Output the results
with open('output.txt', 'w') as f:
    f.write(text)

您会从“output.txt”的内容中注意到，重音字符的表示正确。唯一的文本错误是必须处理的错误代码点中的智能引号。

extractText() 的输出是一个 unicode 字符串，因此如果源代码正确编码，重音字符应该不会有任何问题。

PDF的结构是将图像层和文本层分开。图像层通常是顶层，以使其外观更整洁。不幸的是，这意味着您无法用肉眼看到底层文本的任何问题。在没有看到您正在处理的 PDF 的情况下，我怀疑在创建 PDF 时添加到 PDF 的文本不正确。

【讨论】：