如何从一个 DOCX 文件中逐页编写单独的 DOCX 文件？答案

【问题标题】：How to write separate DOCX files by page from one DOCX file?如何从一个 DOCX 文件中逐页编写单独的 DOCX 文件？
【发布时间】：2020-05-16 12:28:40
【问题描述】：

我有一个包含数百页的 MS Word 文档。

除了每个页面上唯一的人名之外，每个页面都是相同的。（一页是一个用户）。

我想获取这个 word 文档并自动化处理以单独保存每一页，因此我最终会得到数百个 word 文档，每个人一份，而不是一个由每个人组成的文档，然后我可以分发给不同的人。

我一直在使用模块 python-docx 在这里找到：https://python-docx.readthedocs.io/en/latest/

我正在为如何完成这项任务而苦苦挣扎。

据我研究，不可能遍历每个页面，因为页面不是在 .docx 文件本身中确定的，而是由程序（即 Microsoft Word）生成的。

但是 python-docx 可以解释文本，并且由于每个页面都是相同的，当您看到此文本（给定页面上的最后一段文本）时，我不能对 python 说，认为这是页面的结尾，并且此后的任何内容都是一个新页面。

理想情况下，如果我可以编写一个循环来考虑这一点并创建一个直到该点的文档，并在所有页面上重复，那就太好了。它还需要拍摄所有格式/图片。

我不反对其他方法，例如如果可以的话，首先转换为 PDF。

有什么想法吗？

【问题讨论】：

@scanny 你能给出你的意见吗？
能否分享两个连续示例页面的 Open XML 标记？

标签： python python-3.x xml openxml python-docx

【解决方案1】：

我遇到了完全相同的问题。不幸的是，我找不到按页拆分 .docx 的方法。解决方案是首先使用 python-docx 或 docx2python（无论你喜欢什么）遍历每个页面并提取唯一（人）信息并将其放入列表中，这样你最终会得到：

people = ['person_A', 'person_B', 'person_C', ....]

然后将 .docx 保存为 pdf 按页面拆分 pdf，然后将它们另存为 person_A.pdf 等，如下所示：

from PyPDF2 import PdfFileWriter, PdfFileReader

inputpdf = PdfFileReader(open("document.pdf", "rb"))

for i in range(inputpdf.numPages):
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    with open(f"{people[i]}.pdf", "wb") as outputStream:
        output.write(outputStream)

结果是一堆保存为 Person_A.pdf、Person_B.pdf 等的单页 PDF。希望对您有所帮助。

【讨论】：

【解决方案2】：

我建议使用另一个包 aspose-words-cloud 将 word 文档拆分为单独的页面。目前，它适用于云存储（Aspose 云存储、Amazon S3、DropBox、Google Drive Storage、Google Cloud Storage、Windows Azure Storage 和 FTP Storage）。但是，在不久的将来，它将支持来自请求正文（流）的流程文件。

P.S：我是 Aspose 的开发布道师。

# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
import os
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile


# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxx-xxxxx-xxxx-xxxxx-xxxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxx'

words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'

remoteFolder = 'Temp'
localFolder = 'C:/Temp'
localFileName = '02_pages.docx'
remoteFileName = '02_pages.docx'

#upload file
words_api.upload_file(asposewordscloud.models.requests.UploadFileRequest(open(localFolder + '/' + localFileName,'rb'),remoteFolder + '/' + remoteFileName))

#Split DOCX pages as a zip file
request = asposewordscloud.models.requests.SplitDocumentRequest(name=remoteFileName, format='docx', folder=remoteFolder, zip_output= 'true')
result = words_api.split_document(request)
print("Result {}".format(result.split_result.zipped_pages.href))

#download file
request_download=asposewordscloud.models.requests.DownloadFileRequest(result.split_result.zipped_pages.href)
response_download = words_api.download_file(request_download)
copyfile(response_download, 'C:/'+ result.split_result.zipped_pages.href)

【讨论】：