通过 PDFBox 程序运行多个 PDF答案

【问题标题】：Running multiple PDF through an PDFBox program通过 PDFBox 程序运行多个 PDF
【发布时间】：2020-10-14 23:13:45
【问题描述】：

目前我正在尝试使用 Eclipse 中的 PDFBox 通过文本阅读器在文件夹中运行多个 PDF 文件，该文本阅读器将提取某些术语并将它们输出到文本文件中，然后我将其转换为 Excel 工作表。目前我有这个程序，它适用于单个 PDF 文件：

public static void main(String args[]) throws IOException {

  //Loading an existing document
  File file = new File("ADE_acetylfuranoside_120319_pfister.pdf");
  PDDocument document = PDDocument.load(file);

  //Instantiate PDFTextStripper class
  PDFTextStripper pdfStripper = new PDFTextStripper();

  //Retrieving text from PDF document
  String text = pdfStripper.getText(document);

//...“提取文本的实际代码”...

  PrintStream o = new PrintStream(new File("output.txt"));
  PrintStream console = System.out; 
  System.setOut(o); 
  System.out.println(finalSheet);

我的问题是我想在 Eclipse 上通过这个程序在一个文件夹中运行 500 个 PDF，而不是单独输入每个文件的名称。我也希望它输出如下：

姓名1、号码1、ID1 姓名2、号码2、ID2

但我认为如果我运行多个 PDF，它现在的编写方式只会覆盖第一行。

感谢您的帮助！

【问题讨论】：

标签： java eclipse path

【解决方案1】：

对于第一部分，您可以只使用带有FileFilter 的File 类：

// directoryName could be as simple a "."
File folder = new File(directoryName);
File[] listOfFiles = folder.listFiles(new FileFilter() {
    @Override
    public boolean accept(File pathname) {
        return pathname.getName().toLowerCase().endsWith(".pdf");
    }
});

这为您提供了一个包含特定文件夹/目录中所有文件的 File 对象的数组。现在你可以用你拥有的代码循环遍历它了。

在输出端，您可能希望将输出与输入关联起来。我对你的代码有点困惑，我猜你只是想要每个输入文件的输出文件。所以，也许，是这样的：

// index is the value you used to loop through the `listOfFiles` array
try( FileWriter fileWriter = new FileWriter(listOfFiles[index].getName() + ".output.txt" ) ) {
    fileWriter.write( // the String text you want in the file );
}

这将创建一个名为（取自您的示例）“ADE_acetylfuranoside_120319_pfister.pdf.output.txt”的文件。显然，这可能会改变。在这种情况下，会为每个输入文件创建一个新文件。

【讨论】：