如何将 PDF 中的表格解析为非英语语言答案

【问题标题】：How to parse table in PDF for non-english language如何将 PDF 中的表格解析为非英语语言
【发布时间】：2020-12-13 20:00:01
【问题描述】：

我使用 Camelot 和 tabula 来解析包含西里尔符号的 pdf 文件。但是在输出的 CSV 文件中，我得到了没有俄语符号的乱七八糟的字体。

什么可以帮助我解析非英语语言的 pdf 表格？

import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "1-end", encoding='utf-8')

输出： 00550529-1295-06-UP。 Р§Р§45

【问题讨论】：

这能回答你的问题吗？ How to get data from pdf in Cyrillic?
请发布PDF示例
@mutantkeyboard 这家伙展示的方式根本行不通
@StefanoFiorucci-anakin87 我已经得到了答案。它允许解析页面并将其转换为对我来说很好的 pandas DataFrame。

标签： python-3.x parsing pdf python-camelot

【解决方案1】：

所以，基本上，Camelot 很适合西里尔文。

pip install camelot-py[cv]
import pandas as pd
import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "4, 5", encoding='utf-8')
df_p4 = tables[0].df

输出将非常原始，需要清理，但符号不会被破坏，我认为这是一个很好的结果。

【讨论】：