【问题标题】:In need of a clear example on how to get the word count of DOC and DOCX files需要一个关于如何获取 DOC 和 DOCX 文件字数的清晰示例
【发布时间】:2014-05-05 18:38:05
【问题描述】:

我能够读取 DOC 文件并获取其字数,但它是错误的。

我的代码:

 public class WordCounter {
    public static void main(String[] args) throws Throwable {
        processDOC();
    }

    private static void processDOC() throws Throwable {
        File file = new File("/Users/yjiang/Desktop/whatever.doc");
        File file2 = new File("/Users/yjiang/Desktop/Test.docx");
        File file3 = new File("/Users/yjiang/Desktop/QB Tests 4-14-2014.xls");
        File file4 = new File("/Users/yjiang/Desktop/QB Tests 4-14-2014.xlsx");

        try {
            FileInputStream fs = new FileInputStream(file);
            POIFSFileSystem poifsFileSystem = new POIFSFileSystem(fs);
            DirectoryEntry directoryEntry = poifsFileSystem.getRoot();
            DocumentEntry documentEntry = (DocumentEntry) directoryEntry.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
            DocumentInputStream dis = new DocumentInputStream(documentEntry);
            PropertySet ps = new PropertySet(dis);
            SummaryInformation si = new SummaryInformation(ps);

            System.out.println(si.getWordCount());
        } catch (Exception e) {
            e.printStackTrace();
        }


        try {
            HWPFDocument hwpfDocument = new HWPFDocument(new FileInputStream(file));
            System.out.println(hwpfDocument.getDocProperties().getCWords()); // actually 71 words using word count in MSWord, returned 57.
            System.out.println(hwpfDocument.getDocProperties().getCWordsFtnEnd());
            XWPFDocument xwpfDocument = new XWPFDocument(new FileInputStream(file2)); // actually 71 words using word count in MSWord, returned 57.
            System.out.println(xwpfDocument.getProperties().getExtendedProperties().getUnderlyingProperties().getWords());



            System.out.println();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

“whatever.doc”有 71 个单词,当我运行它时,它只返回 57 个。

似乎我不能使用相同的方法来读取 DOCX 文件,当我运行它时,我得到以下信息:

org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

可以举个例子吗?

【问题讨论】:

  • 如果您要求 Word 更新文档统计信息,那么 POI 是否会看到正确的值?
  • 如何获取更新文档统计信息?顺便说一句,使用 mac。

标签: java apache apache-poi docx doc


【解决方案1】:

我还发现内置的单词计数器给出了奇怪的计数,但文本提取似乎更可靠,所以我使用了这个解决方案:

public long getWordCount(File file) throws IOException {
    POITextExtractor textExtractor;
    if (file.getName().endsWith(".docx")) {
        XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
        textExtractor = new XWPFWordExtractor(doc);
    }
    else if (file.getName().endsWith(".doc")) {
        textExtractor = new WordExtractor(new FileInputStream(file));
    }
    else {
        throw new IllegalArgumentException("Not a MS Word file.");
    }

    return Arrays.stream(textExtractor.getText().split("\\s+"))
     .filter(s -> s.matches("^.*[\\p{L}\\p{N}].*$"))
     .count();
}

如果需要,可以调整底部的正则表达式,但总的来说,这个正则表达式已被证明相当有弹性。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-08-06
    • 1970-01-01
    • 2019-01-23
    • 2011-09-21
    • 1970-01-01
    相关资源
    最近更新 更多