PyPDF2：为什么 PdfFileWriter 会忘记我对文档所做的更改？答案

【问题标题】：PyPDF2: Why does PdfFileWriter forget changes I made to a document?PyPDF2：为什么 PdfFileWriter 会忘记我对文档所做的更改？
【发布时间】：2019-03-01 03:30:37
【问题描述】：

我正在尝试修改 PDF 文件中的文本。文本可以位于Tj 或BDC 类型的对象中。我找到了正确的对象，如果我在更改它们后直接读取它们，它们会显示更新的值。

但是，如果我将完整页面传递给 PdfFileWriter，则更改将丢失。我可能正在更新副本而不是真实对象。我检查了id()，它是不同的。有人知道如何解决这个问题吗？

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import ContentStream
from PyPDF2.generic import TextStringObject, NameObject
from PyPDF2.utils import b_

source = PdfFileReader(open('some.pdf', "rb"))
output = PdfFileWriter()

for page_idx in range(0, 1):

    # Get the current page and it's contents
    page = source.getPage(page_idx)

    content_object = page["/Contents"].getObject()
    content = ContentStream(content_object, source)

    for operands, operator in content.operations:

        if operator == b_("BDC"):

            operands[1][NameObject('/Contents')] = TextStringObject('xyz')

        if operator == b_("Tj"):

            operands[0] = TextStringObject('xyz')

    output.addPage(page)


# Write the stream
outputStream = open("output.pdf", "wb")
output.write(outputStream)
outputStream.close()

【问题讨论】：

for operands, operator in 给你一份来自content 的副本？
可能，但我不是 100% 确定。这也是我想到的。但是我还没有找到直接寻址对象的直接方法。
在PyPDF2.pdf 中找不到.getObject()。不明白你为什么从source重读：content = ContentStream(content_object, source)，我认为此时你放弃了以前的page，但请output.addPage(page)。
我查看了github上的源码，page[NameObject('/Contents')]的类型是PyPDF2.generic.EncodedStreamObject，这意味着它的.getObject()来自EncodedStreamObject > StreamObject > DictionaryObject > PdfObject。所以最后调用的方法是this one.
刚刚检查了id()，不需要调用.getObject()，它是同一个。

标签： python python-3.x pdf pdf-generation pypdf2

【解决方案1】：

解决方案是将正在迭代和更改的ContentStream分配给页面，然后再将其传递给PdfFileWriter：

page[NameObject('/Contents')] = content
output.addPage(page)

我发现解决方案是 this 和 this。

【讨论】：

太棒了！我遇到的下一个问题是使用字符串而不是PyPDF2.pdf.TextStringObject。 repr-wise 看起来不错，但在尝试保存生成的 PDF 时会引发 AttributeError（缺少 writeToStream 或其他内容）。
你可以使用createStringObject。