从 PDF 中提取数据的最简单方法是什么？答案

【问题标题】：What is the easiest way to extract data from a PDF?从 PDF 中提取数据的最简单方法是什么？
【发布时间】：2011-07-26 14:37:34
【问题描述】：

我需要从一些 PDF 文档中提取数据（使用 Java）。我需要知道什么是最简单的方法。

我试过 iText。这对我的需求来说相当复杂。此外，我想它不适用于商业项目。所以这不是一个选择。我还尝试了 PDFBox，并遇到了各种NoClassDefFoundError 错误。

我在 Google 上搜索并发现了其他几个选项，例如 PDF Clown、jPod，但我没有时间尝试所有这些库。我依靠社区通过 Java 阅读 PDF 的经验。

请注意，我不需要创建或操作 PDF 文档。我只需要从具有中等布局复杂性的 PDF 文档中提取文本数据。

请建议从 PDF 文档中提取文本的最快和最简单的方法。谢谢。

【问题讨论】：

标签： java pdf

【解决方案1】：

我建议尝试 Apache Tika。 Apache Tika 基本上是一个从多种类型的文档（包括 PDF）中提取数据的工具包。

Tika 的好处（除了免费）是它曾经是 Apache Lucene 的一个子项目，它是一个非常强大的开源搜索引擎。 Tika 包含一个内置的 PDF 解析器，它使用 SAX 内容处理程序将 PDF 数据传递到您的应用程序。它还可以从加密的 PDF 中提取数据，并允许您创建或子类化现有解析器以自定义行为。

代码很简单。要从 PDF 中提取数据，您需要做的就是创建一个实现 Parser 接口的 Parser 类并定义一个 parse() 方法：

public void parse(
   InputStream stream, ContentHandler handler,
   Metadata metadata, ParseContext context)
   throws IOException, SAXException, TikaException {

   metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
   metadata.set("Hello", "World");

   XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
   xhtml.startDocument();
   xhtml.endDocument();
}

然后，要运行解析器，您可以执行以下操作：

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

【讨论】：

【解决方案2】：

我正在使用JPedal，我对结果非常满意。它不是免费的，但质量很高，从 pdf 或文本提取生成图像的输出非常好。

作为付费图书馆，我们随时提供支持。

【讨论】：

感谢@Mauricio，但不幸的是，图书馆需要免费。 :-(
相信我，我尝试了很多免费的库，但它们的性能和选项都比不上 JPedal。我相信许可证大约是 800 美元，因此对于您将获得的功能来说它相当便宜。如果您真的需要这个，您应该要求您的公司提供尽可能好的产品。

【解决方案3】：

我使用 PDFBox 为 Lucene 索引提取文本，没有太多问题。如果我没记错的话，它的错误/警告日志非常冗长——您收到这些错误的原因是什么？

【讨论】：

对于Lucene，我的 IDE 说类不可用。事实上整个searchengine 包都不可用。（我从 Apache 站点下载了最新的 PDFBox 版本。）
接下来我尝试使用PDFParser。这是我得到的错误：Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:58)
谢谢@Petteri。 Lucene 仍然不起作用。其他错误是固定的。现在我需要一个很好的教程，从中我可以快速学习如何从 PDF 文档中提取文本。你能指点我一些好的教程吗？

【解决方案4】：

我知道这篇文章已经很老了，但我建议从这里使用 itext： http://sourceforge.net/projects/itext/ 如果您使用的是 maven，您可以从 maven 中心拉入罐子： http://mvnrepository.com/artifact/com.itextpdf/itextpdf

我不明白使用它有多困难：

    PdfReader pdf = new PdfReader("path to your pdf file");
    PdfTextExtractor parser = new PdfTextExtractor();
    String output = parser.getTextFromPage(pdf, pageNumber);
    assert output.contains("whatever you want to validate on that page");

【讨论】：

PdfTextExtractior 是私有的

【解决方案5】：

导入这个类并添加 Jar 文件 1.- pdfbox-app- 2.0.

   import org.openqa.selenium.WebDriver;
   import org.openqa.selenium.WebElement;
   import org.openqa.selenium.support.FindBy;
   import org.testng.Assert;
   import org.testng.annotations.Test;

   import java.io.File;
   import java.io.IOException;
   import java.text.ParseException;
   import java.util.List;

   import org.apache.log4j.Logger;
   import org.apache.log4j.PropertyConfigurator;
   import org.apache.pdfbox.pdmodel.PDDocument;
   import org.apache.pdfbox.text.PDFTextStripper;
   import org.openqa.selenium.By;
   import org.openqa.selenium.chrome.ChromeDriver;


   import com.coencorp.selenium.framework.BasePage;
   import com.coencorp.selenium.framework.ExcelReadWrite;
   import com.relevantcodes.extentreports.LogStatus;

在类中添加这段代码。

   public void showList() throws InterruptedException, IOException {

   showInspectionsLink.click();
   waitForElement(hideInspectionsLink);
   printButton.click();
   Thread.sleep(10000);
   String downloadPath = "C:\\Users\\Updoer\\Downloads";
   File getLatestFile = getLatestFilefromDir(downloadPath);
   String fileName = getLatestFile.getName();
   Assert.assertTrue(fileName.equals("Inspections.pdf"), "Downloaded file name is not 
   matching with expected file name");
   Thread.sleep(10000);
   //testVerifyPDFInURL();
   PDDocument pd;
   pd= PDDocument.load(new File("C:\\Users\\Updoer\\Downloads\\Inspections.pdf"));
   System.out.println("Total Pages:"+ pd.getNumberOfPages());
   PDFTextStripper pdf=new PDFTextStripper();
   System.out.println(pdf.getText(pd));

在同一个类中添加这个方法。

   public void testVerifyPDFInURL() {
   WebDriver driver = new ChromeDriver();
   driver.get("C:\\Users\\Updoer\\Downloads\\Inspections.pdf");
   driver.findElement(By.linkText("Adeeb Khan")).click();
   String getURL = driver.getCurrentUrl();
   Assert.assertTrue(getURL.contains(".pdf"));
   }

   private File getLatestFilefromDir(String dirPath){
   File dir = new File(dirPath);
   File[] files = dir.listFiles();
   if (files == null || files.length == 0) {
        return null;
   }

   File lastModifiedFile = files[0];
   for (int i = 1; i < files.length; i++) {
   if (lastModifiedFile.lastModified() < files[i].lastModified()) {
   lastModifiedFile = files[i];
   }
   }
   return lastModifiedFile;
   }

【讨论】：

在您发布的代码中，可能只有 5 行与当前问题的上下文相关。