【发布时间】:2021-07-19 08:21:12
【问题描述】:
我正在尝试使用 Tabula 从基于文本的 pdf 中读取数据。在某些 pdf 中,表格没有可见的底部边框。有没有办法阅读这样的pdf?
PDDocument pd = PDDocument.load(new File(filename));
int totalPages = pd.getNumberOfPages();
System.out.println("Total Pages in Document: "+totalPages);
ObjectExtractor oe = new ObjectExtractor(pd);
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = oe.extract(1);
// extract text from the table after detecting
List<Table> table = sea.extract(page);
System.out.println("table*** "+table.size());
for(Table tables: table) {
List<List<RectangularTextContainer>> rows = tables.getRows();
for(int i=1; i<rows.size(); i++) {
List<RectangularTextContainer> cells = rows.get(i);
for(int j=0; j<cells.size(); j++) {
System.out.print(cells.get(j).getText()+"|");
}
}
}
【问题讨论】:
标签: pdf text-extraction tabula