pdfbox快照2.0中PDFTextStripper的等价物是什么答案

【问题标题】：What is the equivalent of PDFTextStripper in pdfbox snapshot 2.0pdfbox快照2.0中PDFTextStripper的等价物是什么
【发布时间】：2015-09-12 21:01:25
【问题描述】：

我目前正在使用 pdfbox 1.8 来分析 PDF 文档。下面是我正在做的一个非常精简的示例。

 import java.util.List;
 import java.io.IOException;
 import javax.swing.JFileChooser;
 import org.apache.pdfbox.pdmodel.PDDocument;
 import org.apache.pdfbox.pdmodel.PDPage;
 import org.apache.pdfbox.pdmodel.common.PDStream;

 public class Main 
 {
   private static PDDocument reader;

   public static void main(String[] args)
   {
       JFileChooser chooser = new JFileChooser();
       int result = chooser.showOpenDialog(null);
       if(result == JFileChooser.APPROVE_OPTION)
       {
           try
           {
               reader = PDDocument.load(chooser.getSelectedFile());
               for(int pagenum = 1; pagenum <= reader.getNumberOfPages(); pagenum++)
               {
                   System.out.println("===== Page:" + pagenum + " ======");
                   System.out.println(extract(pagenum));
               }

           }
           catch(Exception e) { e.printStackTrace(); }

       }
   }

   public static String extract(int pagenum) throws IOException
   {
       List allPages = reader.getDocumentCatalog().getAllPages();
       PDPage page = (PDPage) allPages.get(pagenum-1);
       PDStream contents = page.getContents();
       CustomPDFTextStripper stripper = new CustomPDFTextStripper();        
       if (contents != null) 
       {
           stripper.processStream(page, page.findResources(), page.getContents().getStream());
       }
       return stripper.getContents();
   }
 }

和

 import org.apache.pdfbox.util.PDFTextStripper;
 import java.io.IOException;
 import org.apache.pdfbox.util.TextPosition;

 public class CustomPDFTextStripper extends PDFTextStripper
 {
   private final StringBuilder builder;
   private float lastBase;
   public CustomPDFTextStripper() throws IOException
   {
       super.setSortByPosition(true);
       builder = new StringBuilder();
       lastBase = Float.MAX_VALUE;
   }

   public String getContents() { return builder.toString(); }

   @Override
   protected void processTextPosition(TextPosition textPos)
   {
       float ascent = textPos.getY();
       if(ascent > lastBase)
           builder.append("\n");
       lastBase = textPos.getY() + textPos.getHeight();
       builder.append(textPos.getCharacter());
       // I want to be able to do stuff here and
       // I need to read spaces and newline characters
   }
 }

我似乎无法在 pdfbox 2.0 快照中找到等效的解决方案（我知道它不稳定且尚未发布）。我尝试使用类似的东西：

 CustomPDFTextStripper stripper = new CustomPDFTextStripper();        
 StringWriter dummy = new StringWriter();
 stripper.setPageStart(""+(pagenum-1));
 stripper.setPageEnd(""+(pagenum-1));
 stripper.writeText(reader, dummy);

但它不处理空格或在 processTextPostion 方法中给出准确的 textPos 数据。

关于如何在 2.0 中获取与 1.8 相同的所有 TextPostion 数据的任何想法？

========== 编辑 2015 年 6 月 26 日晚上 8:00 CST ===========

好的，我花了一些时间查看它并发现了问题。 getWidthOfSpace() 在 1.8 和 2.0 之间返回截然不同的结果。

在 1.8 中约为 2.49 - 字符宽度约为 5

在 2.0 中约为 27.5 - 字符宽度约为 5

显然27.5在2.0中是错误的

只需运行以下测试，您就会看到

 @Override
 protected void processTextPosition(TextPosition textPos)
 {
    float spaceWidth = textPos.getWidthOfSpace();
    float width = textPos.getWidth();
    System.out.println(textPos.getCharacter() + " - Width of Space=" + spaceWidth + " - width=" + width);
    builder.append(textPos.getCharacter());
 }

（当然 getUnicode() 用于 2.0 而不是 getCharacter()）

===== 编辑 2015 年 6 月 27 日晚上 8:00 CST ======

这是测试中使用的 PDF 链接： Hello World

【问题讨论】：

如果这适用于 1.8 而不是 2.0（尤其是 PrintTextLocations 示例），那么请使用 JIRA 打开一个问题并附上您的 PDF。
如果您今天刚刚下载，这也是一个临时错误。尝试使用 2 天前的修订版。或者，使用当前版本并恢复文件 BaseParser.java 中的最后一次更改（rev 1687653）并再次构建它。或者观看 PDFBOX-2301 中的 cmets，它可能会在本周末修复。
我没有用上面的代码检查 2.0 - 检查了我复杂的原始代码和 TextPosition 数据是错误的。明天晚上我会检查上面的代码并遵循你的建议。
请说明TextPosition 数据错误 的意思。此外，关于：我没有使用上述代码检查 2.0 - 请提供可用于重现问题的代码。
你能分享一个示例 PDF 来重现它吗？

标签： java pdf pdfbox

【解决方案1】：

当前的空间宽度计算确实有错误。 PDFTextStreamEngine.showGlyph(Matrix, PDFont, int, String, Vector) 目前（这是一个SNAPSHOT，今晚情况可能会有所不同）这样计算宽度：

float horizontalScalingText = getGraphicsState().getTextState().getHorizontalScaling()/100f;
[...]
// the space width has to be transformed into display units
float spaceWidthDisplay = spaceWidthText * fontSizeText * horizontalScalingText *
        textRenderingMatrix.getScalingFactorX()  * ctm.getScalingFactorX();

（PDFTextStreamEngine.java 修订版 1688116）

但textRenderingMatrix 已在PDFStreamEngine.showText(byte[]) 中使用：

float horizontalScaling = textState.getHorizontalScaling() / 100f;
[...]
Matrix parameters = new Matrix(
        fontSize * horizontalScaling, 0, // 0
        0, fontSize,                     // 0
        0, textState.getRise());         // 1
[...]
Matrix textRenderingMatrix = parameters.multiply(textMatrix).multiply(ctm);

（PDFStreamEngine.java 修订版 1688116）

因此，字体大小和水平缩放都乘以两倍的空间宽度。此外，当前变换矩阵既完全乘以textRenderingMatrix，也部分用作ctm.getScalingFactorX()；这可以构成最有趣的组合结果。

从PDFTextStreamEngine.showGlyph(Matrix, PDFont, int, String, Vector) 中的spaceWidthDisplay 计算中删除这些值作为显式因素很可能就足够了

在 1.8.9 版本中，文本空间宽度在 PDFStreamEngine.processEncodedText(byte[]) 中的计算方式如下：

float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText 
                        * textMatrix.getXScale() * ctm.getXScale();

对于有趣的当前转换和文本矩阵，这也会产生有趣的结果，但上述感兴趣的因素并没有乘以两次。

【讨论】：

感谢您的信息 - 我能够通过以下方法获得正确的空间宽度：float spaceWidth = textPos.getFont().getSpaceWidth() * textPos.getTextMatrix().getScaleX() / 1000 ;我会将您的答案标记为解决方案，并密切关注未来的 SNAPSHOT - 谢谢
* 我能够通过以下方法获得正确的空间宽度* - 恐怕这有时会起作用，因为它忽略了几个因素。