【发布时间】:2023-03-29 17:10:02
【问题描述】:
【问题讨论】:
标签: python-3.x pdftotext tabula
【问题讨论】:
标签: python-3.x pdftotext tabula
这张桌子对tabla来说可能有点难。用guess=False, stream=True怎么样?
更新:从 tabula-py 1.0.3 开始,guess 和 stream 应该可以一起工作。无需设置guess=False 即可使用stream 或lattice 选项。
【讨论】:
tabula.convert_into("/Downloads/Test_Invoices/Invoice4.pdf", "/Downloads/Test_Invoices/Invoice4.csv", output_format="csv",spreadsheets=True,guess=False, stream=True) 尝试了你的答案,但没有提取任何表格
pages选项。默认情况下,tabula-py 设置为 1。
pages="all" 或pages=2 为read_pdf() 或convert_into()。以后的细节,你最好阅读手册github.com/chezou/tabula-py/blob/master/README.md或者你可以检查测试代码github.com/chezou/tabula-py/blob/master/tests/…
我通过tabula-py解决了这个问题
conda install tabula-py
和
>>> import tabula
>>> area = [70, 30, 750, 570] # Seems to have to be done manually
>>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False,
stream=True, multiple_tables=False, area=area, pages="all",
) # `tabula` doc explains params very well
>>> page2
我得到了这个结果
> 'pages' argument isn't specified.Will extract only from page 1 by default. [
> ShortTitle Text \ 0
> Arena3Dweb 3D visualisation of multilayered networks 1
> Aviator Monitoring the availability of web services 2
> b2bTools Predictions for protein biophysical features and 3
> NaN their conservation 4
> BENZ WS Four-level Enzyme Commission (EC) number ..
> ... ... 68
> miRTargetLink2 miRNA target gene and target pathway
> 69 NaN networks
> 70 mmCSM-PPI Effects of multiple point mutations on
> 71 NaN protein-protein interactions
> 72 ModFOLD8 Quality estimates for 3D protein models
>
>
> URL 0 http://bib.fleming.gr/Arena3D 1
> https://www.ccb.uni-saarland.de/aviator 2
> https://bio2byte.be/b2btools/ 3
> NaN 4 https://benzdb.biocomp.unibo.it/ ..
> ... 68 https://www.ccb.uni-saarland.de/mirtargetlink2 69
> NaN 70 http://biosig.unimelb.edu.au/mmcsm ppi 71
> NaN 72 https://www.reading.ac.uk/bioinf/ModFOLD/ [73
> rows x 3 columns]]
这是一个可迭代的obj,所以你可以通过for row in page2:来操作它
希望对你有帮助
【讨论】:
Tabula-py 无边框表格提取:
Tabula-py 有 stream,它在 True 上基于间隙检测表格。
from tabula convert_into
src_pdf = r"src_path"
des_csv = r"des_path"
convert_into(src_pdf, des_csv, guess=False, lattice=False, stream=True, pages="all")
【讨论】: