【问题标题】:Convert a PDF file to image将 PDF 文件转换为图像
【发布时间】:2013-08-13 21:38:29
【问题描述】:

我想将 PDF 文档转换为图像。我使用的是 Ghost4j。

问题: Ghost4J 在运行时需要 gsdll32.dll 文件,而我确实想使用该 dll 文件。

问题一:有没有什么办法可以在ghost4j中不用dll转换图片?

问题 2: 我在 PDFBox API 中找到了解决方案。 org.apache.pdfbox.pdmodel.PDPagep have methodconvertToImage()` 将 PDF 页面转换为图像格式。

PDDocument doc = PDDocument.load(new File("/document.pdf"));
List<PDPage>pages =  doc.getDocumentCatalog().getAllPages();
PDPage page = pages.get(0);
BufferedImage image =page.convertToImage();
File outputfile = new File("/image.png");
ImageIO.write(image, "png", outputfile);
doc.close();

我的 PDF 文档只有文本。当我运行此代码时出现异常:

Aug 12, 2013 6:00:24 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Exception in thread "main" java.lang.ExceptionInInitializerError
    at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getawtFont(PDTrueTypeFont.java:481)
    at org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:109)
    at org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:235)
    at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:496)
    at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
    at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
    at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
    at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:125)
    at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:781)
    at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:712)
    at ge.eid.esignature.adessa.pades.sign.PDFtoImage.main(PDFtoImage.java:25)
Caused by: java.lang.IllegalArgumentException
    at java.nio.Buffer.position(Buffer.java:216)
    at sun.font.TrueTypeFont.lookupName(TrueTypeFont.java:1153)
    at sun.font.TrueTypeFont.getPostscriptName(TrueTypeFont.java:1205)
    at java.awt.Font.getPSName(Font.java:1156)
    at org.apache.pdfbox.pdmodel.font.FontManager.loadFonts(FontManager.java:101)
    at org.apache.pdfbox.pdmodel.font.FontManager.<clinit>(FontManager.java:53)
    ... 13 more

【问题讨论】:

    标签: java pdf pdf-generation pdfbox ghost4j


    【解决方案1】:

    您可以轻松地将04-Request-Headers.pdf文件页面转换为图像格式。

    使用 PDF Box 将所有 pdf 页面转换为 Java 中的图像格式。

    Apache PDFBox 1.8.* 版本解决方案:

    需要罐子pdfbox-1.8.3.jar

    或maven依赖

    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>1.8.3</version>
    </dependency>
    

    解决方法如下:

    package com.pdf.pdfbox.examples;
    
    import java.awt.image.BufferedImage;
    import java.io.File;
    import java.util.List;
    
    import javax.imageio.ImageIO;
    
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    
    @SuppressWarnings("unchecked")
    public class ConvertPDFPagesToImages {
        public static void main(String[] args) {
            try {
            String sourceDir = "C:/Documents/04-Request-Headers.pdf"; // Pdf files are read from this folder
            String destinationDir = "C:/Documents/Converted_PdfFiles_to_Image/"; // converted images from pdf document are saved here
    
            File sourceFile = new File(sourceDir);
            File destinationFile = new File(destinationDir);
            if (!destinationFile.exists()) {
                destinationFile.mkdir();
                System.out.println("Folder Created -> "+ destinationFile.getAbsolutePath());
            }
            if (sourceFile.exists()) {
                System.out.println("Images copied to Folder: "+ destinationFile.getName());             
                PDDocument document = PDDocument.load(sourceDir);
                List<PDPage> list = document.getDocumentCatalog().getAllPages();
                System.out.println("Total files to be converted -> "+ list.size());
    
                String fileName = sourceFile.getName().replace(".pdf", "");             
                int pageNumber = 1;
                for (PDPage page : list) {
                    BufferedImage image = page.convertToImage();
                    File outputfile = new File(destinationDir + fileName +"_"+ pageNumber +".png");
                    System.out.println("Image Created -> "+ outputfile.getName());
                    ImageIO.write(image, "png", outputfile);
                    pageNumber++;
                }
                document.close();
                System.out.println("Converted Images are saved at -> "+ destinationFile.getAbsolutePath());
            } else {
                System.err.println(sourceFile.getName() +" File not exists");
            }
    
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    }
    

    可以将图像转换为jpg, jpeg, png, bmp, gif 格式。

    注意:我提到了主要使用的图像格式。

    ImageIO.write(image , "jpg", new File( destinationDir +fileName+"_"+pageNumber+".jpg" ));
    ImageIO.write(image , "jpeg", new File( destinationDir +fileName+"_"+pageNumber+".jpeg" ));
    ImageIO.write(image , "png", new File( destinationDir +fileName+"_"+pageNumber+".png" ));
    ImageIO.write(image , "bmp", new File( destinationDir +fileName+"_"+pageNumber+".bmp" ));
    ImageIO.write(image , "gif", new File( destinationDir +fileName+"_"+pageNumber+".gif" ));
    

    控制台输出:

    Images copied to Folder: Converted_PdfFiles_to_Image
    Total files to be converted -> 13
    Aug 06, 2014 1:35:49 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_1.png
    Aug 06, 2014 1:35:50 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_2.png
    Aug 06, 2014 1:35:51 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_3.png
    Aug 06, 2014 1:35:51 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_4.png
    Aug 06, 2014 1:35:52 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_5.png
    Aug 06, 2014 1:35:52 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_6.png
    Aug 06, 2014 1:35:53 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_7.png
    Aug 06, 2014 1:35:53 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_8.png
    Aug 06, 2014 1:35:54 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_9.png
    Aug 06, 2014 1:35:54 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_10.png
    Aug 06, 2014 1:35:54 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_11.png
    Aug 06, 2014 1:35:55 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_12.png
    Aug 06, 2014 1:35:55 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: i
    Image Created -> 04-Request-Headers_13.png
    Converted Images are saved at -> C:\Documents\Converted_PdfFiles_to_Image
    

    Apache PDFBox 2.0.* 版本解决方案:

    必需的罐子pdfbox-2.0.16.jarfontbox-2.0.16.jarcommons-logging-1.2.jar

    或来自 pom.xml 依赖项

    <!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.16</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.pdfbox/fontbox -->
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>fontbox</artifactId>
        <version>2.0.16</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/commons-logging/commons-logging -->
    <dependency>
        <groupId>commons-logging</groupId>
        <artifactId>commons-logging</artifactId>
        <version>1.2</version>
    </dependency>
    

    2.0.16版本解决方案:

    package com.pdf.pdfbox.examples;
    
    import java.awt.image.BufferedImage;
    import java.io.File;
    
    import javax.imageio.ImageIO;
    
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.rendering.ImageType;
    import org.apache.pdfbox.rendering.PDFRenderer;
    
    /**
     * 
     * @author venkataudaykiranp
     * 
     * @version 2.0.16(Apache PDFBox version support)
     *
     */
    public class ConvertPDFPagesToImages {
        public static void main(String[] args) {
            try {
                String sourceDir = "C:\\Users\\venkataudaykiranp\\Downloads\\04-Request-Headers.pdf"; // Pdf files are read from this folder
                String destinationDir = "C:\\Users\\venkataudaykiranp\\Downloads\\Converted_PdfFiles_to_Image/"; // converted images from pdf document are saved here
    
                File sourceFile = new File(sourceDir);
                File destinationFile = new File(destinationDir);
                if (!destinationFile.exists()) {
                    destinationFile.mkdir();
                    System.out.println("Folder Created -> "+ destinationFile.getAbsolutePath());
                }
                if (sourceFile.exists()) {
                    System.out.println("Images copied to Folder Location: "+ destinationFile.getAbsolutePath());             
                    PDDocument document = PDDocument.load(sourceFile);
                    PDFRenderer pdfRenderer = new PDFRenderer(document);
    
                    int numberOfPages = document.getNumberOfPages();
                    System.out.println("Total files to be converting -> "+ numberOfPages);
    
                    String fileName = sourceFile.getName().replace(".pdf", "");             
                    String fileExtension= "png";
                    /*
                     * 600 dpi give good image clarity but size of each image is 2x times of 300 dpi.
                     * Ex:  1. For 300dpi 04-Request-Headers_2.png expected size is 797 KB
                     *      2. For 600dpi 04-Request-Headers_2.png expected size is 2.42 MB
                     */
                    int dpi = 300;// use less dpi for to save more space in harddisk. For professional usage you can use more than 300dpi 
    
                    for (int i = 0; i < numberOfPages; ++i) {
                        File outPutFile = new File(destinationDir + fileName +"_"+ (i+1) +"."+ fileExtension);
                        BufferedImage bImage = pdfRenderer.renderImageWithDPI(i, dpi, ImageType.RGB);
                        ImageIO.write(bImage, fileExtension, outPutFile);
                    }
    
                    document.close();
                    System.out.println("Converted Images are saved at -> "+ destinationFile.getAbsolutePath());
                } else {
                    System.err.println(sourceFile.getName() +" File not exists");
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
    

    【讨论】:

    • 我收到此错误 2015 年 5 月 26 日上午 11:43:31 org.apache.pdfbox.util.PDFStreamEngine processOperator INFO:不支持/禁用操作:BDC 2015 年 5 月 26 日 11:43:31 AM org.apache.pdfbox.util.PDFStreamEngine processOperator 信息:不支持/禁用操作:EMC 我正在使用 pdfbox 1.8.9 jar
    • 最新版PDFbox略有不同。使用 PDFRendered 类。
    • 同时包含文本和图像内容的 pdf 存在问题。我已经看到,在生成最终图像(输入 pdf 文件)后,文本数据被省略,仅显示图像部分(如背景图像等)。感谢您在这方面提供任何帮助。
    • @yeppe 提供 pdf 文件作为链接。我将为您提供意见。
    【解决方案2】:

    您可以尝试使用 NonSequentialParser 来避免某些 PDF 文件出错(带有增量更新):

    PDDocument doc = PDDocument.loadNonSeq(new File("/document.pdf"));

    【讨论】:

    • 非常感谢,对我很有帮助
    【解决方案3】:

    通过 PDFBox 的方式是避免原生绑定的好方法。 尝试使用 PDFBox 中的 PDFImageWriter,我在几行中做了同样的事情,它工作得很好。 您必须提取 PDFDocument 并使用编写器。

    PDFImageWriter.write(doc, "png", null, , Integer.MAX_VALUE, "picture");
    

    适用于所有页面。

    PDFImageWriter.write(doc, "png", null, 0, 0, "picture");
    

    见: PDFImageWriter Javadoc

    【讨论】:

    • 它有同样的例外! :(
    • PDFImageWriter 是否比 ImageIO 更可靠?我更喜欢使用ImageIO,因为它看起来更简单......除非它不那么可靠
    • 根据我的经验,这不会从 PDF 中写入任何图像,您能确认一下吗? IE。我的 PDF 中有一张图片,但它没有显示在 PNG 中
    • PDFImageWriter 将生成的图像写入哪里?
    • ImageIO 是与 PDFBox 2.0 及更高版本一起使用的。请在此处查看迁移指南:pdfbox.apache.org/2.0/migration.html
    【解决方案4】:

    您可能尝试过转换损坏的 PDF 文件。当 PDF 文件包含 JPXEncoded 流时,我遇到了同样的错误。

    【讨论】:

    • 几个 PDF 解析器现在有 jbig2 解码器,应该能够处理这个
    【解决方案5】:

    您可以使用 PDFBox 轻松地将 PDF 转换为图像。 PDFBoxPDFRenderer 类的renderImageWithDPI 方法用于将pdf 转换为图像。

    PDDocument doc=PDDocument.load(new File("filepath/sample.pdf"));
    PDFRenderer pdfRenderer = new PDFRenderer(doc);
    BufferedImage bffim = pdfRenderer.renderImageWithDPI(pageNo, 300, ImageType.RGB);
            String fileName = "image-" + page + ".png";
            ImageIOUtil.writeImage(bim, fileName, 300);
    

    【讨论】:

      【解决方案6】:
       try {           
                      PDDocument document = PDDocument.load(PdfInfo.getPDFWAY());
                      if (document.isEncrypted()) {
                          document.decrypt(PdfInfo.getPASSWORD());
                      }
                      if ("bilevel".equalsIgnoreCase(PdfInfo.getCOLOR())) {
                          PdfInfo.setIMAGETYPE( BufferedImage.TYPE_BYTE_BINARY);
                      } else if ("indexed".equalsIgnoreCase(PdfInfo.getCOLOR())) {
                          PdfInfo.setIMAGETYPE(BufferedImage.TYPE_BYTE_INDEXED);
                      } else if ("gray".equalsIgnoreCase(PdfInfo.getCOLOR())) {
                          PdfInfo.setIMAGETYPE(BufferedImage.TYPE_BYTE_GRAY);
                      } else if ("rgb".equalsIgnoreCase(PdfInfo.getCOLOR())) {
                          PdfInfo.setIMAGETYPE(BufferedImage.TYPE_INT_RGB);
                      } else if ("rgba".equalsIgnoreCase(PdfInfo.getCOLOR())) {
                          PdfInfo.setIMAGETYPE(BufferedImage.TYPE_INT_ARGB);
                      } else {
                          System.exit(2);
                      }
                      PDFImageWriter imageWriter = new PDFImageWriter();
                      boolean success = imageWriter.writeImage(document, PdfInfo.getIMAGE_FORMAT(),PdfInfo.getPASSWORD(),
                              PdfInfo.getSTART_PAGE(),PdfInfo.getEND_PAGE(),PdfInfo.getOUTPUT_PREFIX(),PdfInfo.getIMAGETYPE(),PdfInfo.getRESOLUTION());
                      if (!success) {
                          System.exit(1);
                      }
                      document.close();
      
              } catch (IOException | CryptographyException | InvalidPasswordException ex) {
                  Logger.getLogger(PdfToImae.class.getName()).log(Level.SEVERE, null, ex);
              }
      public class PdfInfo {
          private static String PDFWAY;    
          private static String OUTPUT_PREFIX;
          private static String PASSWORD;
          private static int START_PAGE=1;
          private static int END_PAGE=Integer.MAX_VALUE;
          private static String IMAGE_FORMAT="jpg";
          private static String COLOR="rgb";
          private static int RESOLUTION=256;
          private static int IMAGETYPE=24;
          private static String filename;
          private static String filePath="";
      }
      

      【讨论】:

        【解决方案7】:

        对于错误:

        org.apache.pdfbox.util.PDFStreamEngine processOperator INFO:不支持/禁用操作

        除了 Apache pdfbox jar 之外,您需要在类路径中包含 fontbox-1.7.1 jar,这将解决您的问题,因为 PDFBox 内部使用 fontbox-1.7.1

        【讨论】:

        • “INFO:不支持/禁用的操作”INFO 是无害的,可以忽略。没有人应该使用 1.7.1。当前版本是 2.0.8。
        猜你喜欢
        • 2012-07-28
        • 2014-07-17
        • 1970-01-01
        • 2014-06-13
        • 2015-03-10
        • 2011-03-30
        • 2018-03-22
        • 1970-01-01
        相关资源
        最近更新 更多