Lucene：每个token的TF-IDF答案

【问题标题】：Lucene: TF-IDF of each tokenLucene：每个token的TF-IDF
【发布时间】：2021-07-12 17:34:41
【问题描述】：

我正在尝试学习 Lucene 8，这是我第一次使用 Lucene。

我想要每个学期的 TF-IDF，以便获得我的 Lucene 目录中的前 10.000 个令牌。我已经尝试了很多方法，但我被卡住了，我不知道如何继续。这是我所做的一个例子：

private static void getTokensForField(IndexReader reader, String fieldName) throws IOException {

        List<LeafReaderContext> list = reader.leaves();
        Similarity similarity = new ClassicSimilarity();

        int docnum = reader.numDocs();

        for (LeafReaderContext lrc : list) {
            Terms terms = lrc.reader().terms(fieldName);
            if (terms != null) {
                TermsEnum termsEnum = terms.iterator();

                BytesRef term;
                while ((term = termsEnum.next()) != null) {
                    double tf = termsEnum.totalTermFreq() / terms.size();
                    double idf =Math.log(docnum  / termsEnum.docFreq());
                   // System.out.println(term.utf8ToString() + "\tTF: " + tf + "\tIDF: " + idf);
                }
            }
        }
    }

其实我正在研究这个话题，但是我找到的资源并不是很有用。

我也在互联网上搜索过，但所有内容都已弃用。

你有什么建议吗？

【问题讨论】：

标签： java lucene

【解决方案1】：

我知道访问 TF 和 IDF 等统计信息的最简单方法是使用 Explanation 类。

但是，只是为了澄清（如果我告诉您您已经知道的内容，请道歉）：术语频率值是针对文档中的术语 - 所以相同的术语可能会导致不同的值，跨不同的文档。

我不太确定这对您希望“在我的 Lucene 目录中获得前 10.000 个令牌”意味着什么。也许这意味着您需要计算每个文档中每个术语的 TF，然后根据您的需要为该术语选择“最佳”值？

这是一个构建Explanation的简单示例：

private static void getExplanation(IndexSearcher searcher, Query query, int docID) throws IOException {
    Explanation explanation = searcher.explain(query, docID);
    //explanation.getDescription(); // do what you need with this data
    //explanation.getDetails();     // do what you need with this data
    }

因此，您可以在遍历查询的匹配项时调用此方法：

private static void printHits(Query query) throws IOException {
    IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));

    IndexSearcher searcher = new IndexSearcher(reader);
    TopDocs results = searcher.search(query, 100); // or whatever you need instead of 100
    ScoreDoc[] hits = results.scoreDocs;
    for (ScoreDoc hit : hits) {
        getExplanation(searcher, query, hit.doc);
    }
}

explanation.getDetails() 提供的信息与使用 Luke 分析查询时看到的信息基本相同：

作为文本：

0.14566182 weight(body:war in 3) [BM25Similarity], result of:
  0.14566182 score(freq=1.0), computed as boost * idf * tf from:
    0.2876821 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
      4 n, number of documents containing term
      5 N, total number of documents with field
    0.50632906 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
      1.0 freq, occurrences of term within document
      1.2 k1, term saturation parameter
      0.75 b, length normalization parameter
      3.0 dl, length of field
      4.0 avgdl, average length of field

【讨论】：