【问题标题】:Tabula-py for borderless table extraction用于无边界表格提取的 Tabula-py
【发布时间】:2023-03-29 17:10:02
【问题描述】:

谁能建议我如何使用 python/java 程序从 PDF 中提取表格数据,用于 pdf 文件中存在的以下无边框表格?

【问题讨论】:

    标签: python-3.x pdftotext tabula


    【解决方案1】:

    这张桌子对tabla来说可能有点难。用guess=False, stream=True怎么样?

    更新:从 tabula-py 1.0.3 开始,guessstream 应该可以一起工作。无需设置guess=False 即可使用streamlattice 选项。

    【讨论】:

    • 嗨@chezou 谢谢你的评论。我用下面的代码tabula.convert_into("/Downloads/Test_Invoices/Invoice4.pdf", "/Downloads/Test_Invoices/Invoice4.csv", output_format="csv",spreadsheets=True,guess=False, stream=True) 尝试了你的答案,但没有提取任何表格
    • 嗨@chezou,你知道任何其他与Python/Java相关的库吗?
    • 我建议你设置pages选项。默认情况下,tabula-py 设置为 1。
    • 嗨@chezou,我该怎么做?我不太熟悉指定这些参数值..
    • 设置pages="all"pages=2read_pdf()convert_into()。以后的细节,你最好阅读手册github.com/chezou/tabula-py/blob/master/README.md或者你可以检查测试代码github.com/chezou/tabula-py/blob/master/tests/…
    【解决方案2】:

    我通过tabula-py解决了这个问题

    conda install tabula-py
    

    >>> import tabula
    >>> area = [70, 30, 750, 570] # Seems to have to be done manually
    >>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False, 
                     stream=True, multiple_tables=False, area=area, pages="all",
                       ) # `tabula` doc explains params very well
    >>> page2
    

    我得到了这个结果

    > 'pages' argument isn't specified.Will extract only from page 1 by default. [      
    > ShortTitle                                              Text  \  0    
    > Arena3Dweb         3D visualisation of multilayered networks     1    
    > Aviator       Monitoring the availability of web services     2       
    > b2bTools  Predictions for protein biophysical features and     3      
    > NaN                                their conservation     4         
    > BENZ WS          Four-level Enzyme Commission (EC) number     ..      
    > ...                                               ...     68 
    > miRTargetLink2              miRNA target gene and target pathway    
    > 69             NaN                                          networks  
    > 70       mmCSM-PPI            Effects of multiple point mutations on  
    > 71             NaN                      protein-protein interactions  
    > 72        ModFOLD8           Quality estimates for 3D protein models  
    > 
    >  
    >                                                 URL    0                    http://bib.fleming.gr/Arena3D    1         
    > https://www.ccb.uni-saarland.de/aviator    2                   
    > https://bio2byte.be/b2btools/    3                                    
    > NaN    4                 https://benzdb.biocomp.unibo.it/    ..       
    > ...    68  https://www.ccb.uni-saarland.de/mirtargetlink2    69       
    > NaN    70          http://biosig.unimelb.edu.au/mmcsm ppi    71       
    > NaN    72       https://www.reading.ac.uk/bioinf/ModFOLD/      [73
    > rows x 3 columns]]
    

    这是一个可迭代的obj,所以你可以通过for row in page2:来操作它

    希望对你有帮助

    【讨论】:

      【解决方案3】:

      Tabula-py 无边框表格提取:

      Tabula-py 有 stream,它在 True 上基于间隙检测表格。

      from tabula convert_into
      src_pdf = r"src_path"
      des_csv = r"des_path"
      convert_into(src_pdf, des_csv, guess=False, lattice=False, stream=True, pages="all")
      
           
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2023-01-17
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多