【发布时间】:2024-01-09 22:23:01
【问题描述】:
我有一个 json,它以 base64 格式存储各种文件类型(例如,pdf、docx、doc)。所以我已经能够成功地转换 pdf 和 docx 文件,并通过将它们传递到内存中来读取它们的内容,而不是将它们转换成物理文件然后再读取它们。但是,我无法对 doc 文件执行此操作。
谁能指出我正确的方向。我在 Windows 上并尝试过 textract 但无法让库正常工作。我愿意接受其他解决方案。
#This works using a docx file
resume = (df.iloc[180]['Candidate_Resume_Attachment_Base64_Image'])
resume_bytes = resume.encode('ascii')
decoded = base64.decodebytes(resume_bytes)
result = BytesIO()
result.write(decoded)
docxReader = docx2txt.process(result)
#This does not working using a doc file
message=((df.iloc[361]['Candidate_Resume_Attachment_Base64_Image']))
resume_bytes = message.encode('ascii')
decoded = base64.decodebytes(resume_bytes)
result = BytesIO()
result.write(decoded)
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(result)
#error:
ret = self._oleobj_.InvokeTypes(19, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName
com_error: (-2147352571, 'Type mismatch.', None, 16)
【问题讨论】:
标签: python base64 doc in-memory