读取内存中的 doc 文件答案

【问题标题】：Reading a doc file in memory读取内存中的 doc 文件
【发布时间】：2024-01-09 22:23:01
【问题描述】：

我有一个 json，它以 base64 格式存储各种文件类型（例如，pdf、docx、doc）。所以我已经能够成功地转换 pdf 和 docx 文件，并通过将它们传递到内存中来读取它们的内容，而不是将它们转换成物理文件然后再读取它们。但是，我无法对 doc 文件执行此操作。

谁能指出我正确的方向。我在 Windows 上并尝试过 textract 但无法让库正常工作。我愿意接受其他解决方案。

#This works using a docx file
resume = (df.iloc[180]['Candidate_Resume_Attachment_Base64_Image'])
resume_bytes = resume.encode('ascii')
decoded = base64.decodebytes(resume_bytes)
result = BytesIO()
result.write(decoded)
docxReader = docx2txt.process(result)

#This does not working using a doc file
message=((df.iloc[361]['Candidate_Resume_Attachment_Base64_Image']))
resume_bytes = message.encode('ascii')
decoded = base64.decodebytes(resume_bytes)
result = BytesIO()
result.write(decoded)
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(result)

#error:
    ret = self._oleobj_.InvokeTypes(19, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName

com_error: (-2147352571, 'Type mismatch.', None, 16)

【问题讨论】：

标签： python base64 doc in-memory

【解决方案1】：

如果其他人需要读取内存中的 doc 文件，这是我的 hacky 解决方案，直到我找到更好的解决方案。

1) 使用 olefile 库读取 doc 文件，这会导致 unicode 中的字符混合。 2) 使用正则表达式捕获文本。

        import olefile
        #retrieve base64 image and decode into bytes, in this case from a df
        message = row['text']
        text_bytes = message.encode('ascii')
        decoded = base64.decodebytes(text_bytes)
        #write in memory
        result = BytesIO()
        result.write(decoded)
        #open and read file
        ole=olefile.OleFileIO(result)
        y = ole.openstream('WordDocument').read()
        y=y.decode('latin-1',errors='ignore')
        #replace all characters that are not part of the unicode list below (all latin characters) and spaces with an Astrisk. This can probably be shortened using a similar pattern used in the next step and combining them
        y=(re.sub(r'[^\x0A,\u00c0-\u00d6,\u00d8-\u00f6,\u00f8-\u02af,\u1d00-\u1d25,\u1d62-\u1d65,\u1d6b-\u1d77,\u1d79-\u1d9a,\u1e00-\u1eff,\u2090-\u2094,\u2184-\u2184,\u2488-\u2490,\u271d-\u271d,\u2c60-\u2c7c,\u2c7e-\u2c7f,\ua722-\ua76f,\ua771-\ua787,\ua78b-\ua78c,\ua7fb-\ua7ff,\ufb00-\ufb06,\x20-\x7E]',r'*', y))
        #Isolate the body of the text from the rest of the gibberish
        p=re.compile(r'\*{300,433}((?:[^*]|\*(?!\*{14}))+?)\*{15,}')
        result=(re.findall(p, y))
        #remove * left in the capture group
        result = result[0].replace('*','')

对我来说，我需要确保在解码过程中，重音字符不会丢失，而且由于我的文档是英语、西班牙语和葡萄牙语，因此我选择使用 latin-1 进行解码。从那里我使用正则表达式模式来识别所需的文本。解码后，我发现在我的所有文档中，捕获组前面都有 ~400 '*' 和一个 ':' 。不确定这是否是使用此方法解码时所有 doc 文档的规范，但我以此为起点创建了一个正则表达式模式，以将所需的文本与其他乱码隔离开来。

【讨论】：