Python docx - AttributeError：'bytes'对象没有属性'seek'答案

【问题标题】：Python docx - AttributeError: 'bytes' object has no attribute 'seek'Python docx - AttributeError：'bytes'对象没有属性'seek'
【发布时间】：2023-08-12 13:31:01
【问题描述】：

我输入的内容：docx 以 byte64 格式记录原始字节。
我想要实现的目标：从此文档中提取文本以进行进一步处理。
我试图遵循这个答案：extracting text from MS word files in python

我的代码片段：

base64_bytes = input.encode('utf-8')
decoded_data = base64.decodebytes(base64_bytes)
document = Document(decoded_data)
docText = '\n\n'.join([paragraph.text.encode('utf-8') for paragraph in document.paragraphs])

document = Document(decoded_data) 行给了我以下错误：AttributeError: 'bytes' object has no attribute 'seek'
decoded_data 的格式如下：b'PK\\x03\\x04\\x14\\x00\\x08\\x08\\x08\\x00\\x87@CP\\x00...

我应该如何格式化原始数据以从 docx 中提取文本？

【问题讨论】：

input.encode('utf-8')。这是您的实际代码吗？因为这是试图将函数对象 input 编码为 UTF-8
1) 您的标题为“seek”，您的问题为“code”。它是哪一个？ 2) Document 到底是什么，它期望什么样的论点？
你说你正在遵循Use the native Python docx module... 的建议，然后 -- 你确实不遵循它。您确实不需要需要“手动”编码、解码甚至显式加载文件。
@usr2564301 他们只在需要的地方发散，他们的输入是内存中的 base64 内容而不是磁盘上的文件。

标签： python docx

【解决方案1】：

来自官方文档，强调我的：

docx.Document(docx=None)

返回从 docx 加载的 Document 对象，其中 docx 可以是 .docx 文件（字符串）的路径或类似文件的对象。如果 docx 缺失或没有，则加载内置的默认文档“模板”。

因此，如果您提供一个字符串或类似字符串的参数，它将被解释为 docx 文件的路径。要从内存中提供内容，您需要传入一个类似文件的对象，也就是一个 BytesIO 实例（StringIO 和 BytesIO 的全部意义在于将字符串和字节“转换”为类似文件的对象）：

document = Document(io.BytesIO(decoded_data))

旁注：您可能想删除列表理解中的.encode 调用，在 Python 3 中，文本 (str) 和字节 (bytes) 根本不兼容，因此当您使用尝试用文本分隔符连接字节（编码文本）。

【讨论】：