如何提高 Lucene.net 的索引速度答案

【问题标题】：How can I improve Lucene.net indexing speed如何提高 Lucene.net 的索引速度
【发布时间】：2016-07-30 06:06:01
【问题描述】：

我正在使用 lucene.net 来索引我的 pdf 文件。索引 15000 个 pdf 大约需要 40 分钟，并且索引时间随着我文件夹中 pdf 文件数量的增加而增加。

如何提高 lucene.net 中的索引速度？
还有其他索引性能快速的索引服务吗？

我正在使用最新版本的 lucene.net 索引 (Lucene.net 3.0.3)。

这是我的索引代码。

public void refreshIndexes() 
        {
            // Create Index Writer
            string strIndexDir = @"E:\LuceneTest\index";
            IndexWriter writer = new IndexWriter(Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir)), new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);

            // Find all files in root folder create index on them
            List<string> lstFiles = searchFiles(@"E:\LuceneTest\PDFs");
            foreach (string strFile in lstFiles)
            {
                Document doc = new Document();
                string FileName = System.IO.Path.GetFileNameWithoutExtension(strFile);
                string Text = ExtractTextFromPdf(strFile);
                string Path = strFile;
                string ModifiedDate = Convert.ToString(File.GetLastWriteTime(strFile));
                string DocumentType = string.Empty;
                string Vault = string.Empty;

                string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150);
                foreach (var docs in ltDocumentTypes)
                {
                    if (headerText.ToUpper().Contains(docs.searchText.ToUpper()))
                    {
                        DocumentType = docs.DocumentType;
                        Vault = docs.VaultName; ;
                    }
                }

                if (string.IsNullOrEmpty(DocumentType))
                {
                    DocumentType = "Default";
                    Vault = "Default";
                }

                doc.Add(new Field("filename", FileName, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("text", Text, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("path", Path, Field.Store.YES, Field.Index.NOT_ANALYZED));
                doc.Add(new Field("modifieddate", ModifiedDate, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("documenttype", DocumentType, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("vault", Vault, Field.Store.YES, Field.Index.ANALYZED));

                writer.AddDocument(doc);
            }
            writer.Optimize();
            writer.Dispose();
        }

【问题讨论】：

你真的需要打电话给writer.Optimize()吗？ writer.Commit() 还不够吗？
感谢@SimonSvensson 的回复。 Optimize() 不是必需的。通过 commit() 尝试，性能没有提高。
@Munavvar，在提出任何更改之前，您是否尝试为相关方法添加一些基准？我会对 searchFiles 和 ExtractTextFromPdf 方法特别感兴趣。我相信问题可能在后者，因为您的代码看起来不错（除了不应分析的日期）。此外，您的 PDF 文件的大小是多少？您可以将索引和分析限制为相关数量的字符。

标签： c# performance lucene lucene.net full-text-indexing

【解决方案1】：

索引部分看起来不错。请注意，IndexWriter 是线程安全的，因此如果您在多核机器上使用 Parallel.Foreach（MaxConcurrency 设置为内核数。使用此值）可能会有所帮助。

但是文档类型检测部分让你的 GC 发疯了。所有的 ToUpper() 都是痛苦的。

在 lstFiles 循环之外。以大写形式创建 ltDocumentTypes .searchText 的副本
```
var upperDocTypes = ltDocumentTypes.Select(x=>x.searchText.ToUpper()).ToList();
```
在文档类型循环之外创建另一个字符串
```
string headerTestUpper = headerText.ToUpper();
```

当它找到匹配“break”时。一旦您找到匹配项，这将退出循环并阻止所有后续迭代。当然，这意味着首先匹配，而您的匹配最后（如果这对您有影响）

string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150);
foreach (var searchText in upperDocTypes)
{
    if (headerTextUpper.Contains(searchText))
    {
        DocumentType = docs.DocumentType;
        Vault = docs.VaultName;
        break;
    }
}

根据 ltDocumentTypes 的大小，这可能不会给您带来太多改进。

我敢打赌，ExtractTextFromPdf 是最昂贵的部分。通过分析器运行此程序或使用一些 StopWatches 进行检测将知道成本在哪里。

【讨论】：