Python：使用 UTF-8 以二进制模式打开 PDF答案

【问题标题】：Python: open PDF in in binary mode with UTF-8Python：使用 UTF-8 以二进制模式打开 PDF
【发布时间】：2020-10-21 08:36:58
【问题描述】：

我正在尝试使用 PyPDF4 打开一个 PDF 文件。

import PyPDF4

text = ""

pdf_file = open(filename,mode='rb')
pdfReader = PyPDF4.PdfFileReader(pdf_file)
pdfObj = pdfReader.getPage(0)
text = pageObj.extract(pdfObj)

print(text)

效果很好，除了 PDF 的内容是德语并且特殊字符（元音变音）编码错误（例如，zun−chst 而不是 zunächst）。

我无法更改二进制代码中的编码，但如果我不使用二进制代码，则会出现错误

文件“/usr/local/lib/python3.8/site-packages/PyPDF4/pdf.py”，第 1754 行，已读 stream.seek(-1, 2) io.UnsupportedOperation: 不能做非零端相对搜索

这个错误有多个线程（例如Seeking from end of file throwing unsupported exception）然而，似乎没有一个解决方案对我有用。非常感谢任何帮助，谢谢。

【问题讨论】：

这是 pyPDF2 和 pyPDF3 和 pyPDF4 中的一个错误 - 所有三个行为相同。由于此时似乎只有 pyPDF3 处于活动状态，因此我在 github.com/sfneal/PyPDF3/issues/13 创建了一个问题

标签： python utf-8 pypdf

【解决方案1】：

@downbydawn 对上面评论中提到的错误有同样的经历

我最终使用了 https://stackoverflow.com/a/26351413/1497139 的修改版本：

# derived from
# https://stackoverflow.com/a/26351413/1497139

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO

class PDFMiner:
    '''
    PDFMiner wrapper to get PDF Text
    '''

    @classmethod
    def getPDFText(cls,pdfFilenamePath,throwError:bool=True):
        retstr = StringIO()
        parser = PDFParser(open(pdfFilenamePath,'rb'))
        try:
            document = PDFDocument(parser)
        except Exception as e:
            errMsg=f"error {pdfFilenamePath}:{str(e)}"
            print(errMsg)
            if throwError:
                raise e
            return ''
        if document.is_extractable:
            rsrcmgr = PDFResourceManager()
            device = TextConverter(rsrcmgr,retstr,  laparams = LAParams())
            interpreter = PDFPageInterpreter(rsrcmgr, device)
            for page in PDFPage.create_pages(document):
                interpreter.process_page(page)
            return retstr.getvalue()
        else:
            print(pdfFilenamePath,"Warning: could not extract text from pdf file.")
            return ''

【讨论】：

【解决方案2】：

PDF 文件肯定是二进制的；你绝对不应该尝试使用'rb'模式以外的任何东西来阅读它。

你可以做的是解码你提取的文本。如果您知道编码是 UTF-8（根据您展示的示例，这可能不是真的），

print(text.decode('utf-8'))

根据您的单个样本，我认为可以肯定地说编码不是 UTF-8，但是因为当您查看文本时我们不知道您使用的是哪种编码，所以这都是猜测。如果您可以显示字符串中的实际字节，那么从几个样本中找出实际编码应该不难，也许可以借助https://tripleee.github.io/8bit/ 之类的字符图表。您粘贴的字符是U+2212，它似乎不直接对应于任何常见的 ä 8 位编码，但也许这只是粘贴中的一个错误。

也许还可以查看Problematic questions about decoding errors 了解一些背景信息。理想情况下，如果这还没有让您找到可以自己解决问题的地方，最好更新您的问题以提供它要求的详细信息。

如果 PyPDF 真的认为该字符是 "−"，那么它的提取逻辑可能是错误的，或者 PDF 可能有缺陷。如果您无法修复它，可能只需在找到有问题的字符时手动重新映射它们。您可能想要添加带有logging 的调试打印，以突出显示提取文本中可打印 ASCII 范围之外的任何字符，直到您知道您已经覆盖了所有字符。

import re
import logging

# ...
text = text.replace("\u2212", "ä").replace("\u1234", "ö")  # etc
for match in re.findall(r'(.{1,5})?([^äö\n -\u007f])(.{1,5})?', text):
    logging.warning("{0} found in {1}".format(match[1], "".join(match)))

不幸的是，以上内容并不完全有效——无论我传入什么re 标志，U+2212 似乎都特别适合作为 ASCII 范围的一部分。（还要注意占位符 "\u1234"——将其替换为有用的内容，并在找到时添加更多内容。）

【讨论】：

我最初忘记包含提取文本的代码行。我无法解码文本字符串，因为它已经解码。我还没有找到显示实际字节的方法，但会尝试。
简而言之，repr("zun−chst".encode('utf-8')) 显示"b'zun\\xe2\\x88\\x92chst'"，其中b'...' 是Python 指示这是一个字节字符串，\x 转义用于任何不可打印的ASCII 字符的字节.这也方便地向您展示了这个字符的实际 UTF-8 编码是什么样的。
好的，如何查看pdf中字符串的实际字节数？
repr(text) 将是一个好的开始，但可能还不够。不幸的是，您的代码示例重用了变量text，但是您在print(text) 的位置应该能够print(repr(text))。
print(reprise(text))发出zun-chst