如何在 java 中使用 tabula 1.0.3 读取无边框表格答案

【问题标题】：How to read borderless table using tabula 1.0.3 in java如何在 java 中使用 tabula 1.0.3 读取无边框表格
【发布时间】：2021-07-19 08:21:12
【问题描述】：

我正在尝试使用 Tabula 从基于文本的 pdf 中读取数据。在某些 pdf 中，表格没有可见的底部边框。有没有办法阅读这样的pdf？

            PDDocument pd = PDDocument.load(new File(filename));
            int totalPages = pd.getNumberOfPages();
            System.out.println("Total Pages in Document: "+totalPages);
            
            ObjectExtractor oe = new ObjectExtractor(pd);
            
            SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
            
            Page page = oe.extract(1);
                // extract text from the table after detecting
                List<Table> table = sea.extract(page);
                
                System.out.println("table*** "+table.size());
                
                for(Table tables: table) {
                    List<List<RectangularTextContainer>> rows = tables.getRows();
                        for(int i=1; i<rows.size(); i++) {
                            
                            List<RectangularTextContainer> cells = rows.get(i);
                            for(int j=0; j<cells.size(); j++) {
                                System.out.print(cells.get(j).getText()+"|");
                }
             }
         }

【问题讨论】：

标签： pdf text-extraction tabula

【解决方案1】：

SpreadsheetExtractionAlgorithm 期望表格完全由可见边框分隔。

也许您可以使用SpreadsheetExtractionAlgorithm 和BasicExtractionAlgorithm 的组合。一些提示：

使用Page.getArea(...) 获取Page 的区域。
在检测到的表格下方区域中使用BasicExtractionAlgorithm。
合并提取的数据。

【讨论】：