只要有问题的图形被适当地标记(就像它们在您的示例文档中一样),您就可以根据 PDFBox PDFGraphicsStreamEngine 确定它们的边界框。
您实际上可以使用this answer中的BoundingBoxFinder(基于PDFGraphicsStreamEngine)确定页面所有内容的边界框,您只需通过以下方式检索标记内容序列的边界框信息标记的内容序列。
以下类通过将边界框信息存储在 MarkedContext 对象的层次结构中来实现这一点
public class MarkedContentBoundingBoxFinder extends BoundingBoxFinder {
public MarkedContentBoundingBoxFinder(PDPage page) {
super(page);
contents.add(content);
}
@Override
public void processPage(PDPage page) throws IOException {
super.processPage(page);
endMarkedContentSequence();
}
@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
MarkedContent current = contents.getLast();
if (rectangle != null) {
if (current.boundingBox != null)
add(current.boundingBox);
current.boundingBox = rectangle;
}
rectangle = null;
MarkedContent newContent = new MarkedContent(tag, properties);
contents.addLast(newContent);
current.children.add(newContent);
super.beginMarkedContentSequence(tag, properties);
}
@Override
public void endMarkedContentSequence() {
MarkedContent current = contents.removeLast();
if (rectangle != null) {
if (current.boundingBox != null)
add(current.boundingBox);
current.boundingBox = (Rectangle2D) rectangle.clone();
} else if (current.boundingBox != null)
rectangle = (Rectangle2D) current.boundingBox.clone();
super.endMarkedContentSequence();
}
public static class MarkedContent {
public MarkedContent(COSName tag, COSDictionary properties) {
this.tag = tag;
this.properties = properties;
}
public final COSName tag;
public final COSDictionary properties;
public final List<MarkedContent> children = new ArrayList<>();
public Rectangle2D boundingBox = null;
}
public final MarkedContent content = new MarkedContent(COSName.DOCUMENT, null);
public final Deque<MarkedContent> contents = new ArrayDeque<>();
}
(MarkedContentBoundingBoxFinder 实用程序类)
您可以像这样将其应用于PDPage pdPage
MarkedContentBoundingBoxFinder boxFinder = new MarkedContentBoundingBoxFinder(pdPage);
boxFinder.processPage(pdPage);
MarkedContent markedContent = boxFinder.content;
(摘自DetermineBoundingBox辅助方法drawMarkedContentBoundingBoxes)
您可以像这样从 markedContent 对象输出边界框:
void printMarkedContentBoundingBoxes(MarkedContent markedContent, String prefix) {
StringBuilder builder = new StringBuilder();
builder.append(prefix).append(markedContent.tag.getName());
builder.append(' ').append(markedContent.boundingBox);
System.out.println(builder.toString());
for (MarkedContent child : markedContent.children)
printMarkedContentBoundingBoxes(child, prefix + " ");
}
(DetermineBoundingBox 辅助方法)
如果您获得示例文档
Document java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=128.63946533203125,h=10.2509765625]
Figure java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=44.6771240234375,h=10.2509765625]
P java.awt.geom.Rectangle2D$Double[x=136.79600524902344,y=760.1184081963065,w=43.137100359018405,h=6.383056943803922]
Figure java.awt.geom.Rectangle2D$Double[x=184.2926788330078,y=758.10498046875,w=34.70478820800781,h=10.2509765625]
同样,您可以使用DetermineBoundingBox 的drawMarkedContentBoundingBoxes 方法在PDF 上绘制边界框。如果是您的示例文档,您将获得: