Elasticsearch 索引与大型文档 (PDF) 的搜索性能答案

【问题标题】：Search Performance of Elasticsearch index with large documents (PDF’s)Elasticsearch 索引与大型文档 (PDF) 的搜索性能
【发布时间】：2016-02-25 15:02:41
【问题描述】：

我是 Elasticsearch 的新手，希望获得一些关于如何提高索引性能的调优方面的帮助和提示。

目前我在索引中有大约 4500 个文档，磁盘大小约为 34GB，其中包含 PDF 和一些元数据。 PDF 使用 Mapper Attachment 插件编入索引，每个文件大小从 10MB 到 150MB，有些更大到 250MB。

我的问题是搜索操作需要很长时间，有时长达几秒钟，我最多过滤 0 - 7 个字段，排序 2 加上针对文档文本（base64 编码）的查询字符串查询，标题，和其他一些元字段。我还对所有结果使用分页（最多 450 页，每页 10 个文档）并突出显示被击中的部分。我想这是我的问题的一部分，但我真的无法摆脱它。

服务器有 8GB 的 RAM，ElasticSearch 的 ES_HEAP_SIZE 设置为 2GB，我猜这是我的问题的另一部分，瓶颈就在这里，对吧？不知道我可以增加多少，因为它也在运行 Web 服务器。服务器当然可以升级。

我没有从默认值更改任何有关分片的设置。它目前托管在 Azure 中，但我现在不知道我是否有 SSD 或旋转磁盘

我对此并不感到惊讶，但我想了解原因

我可以做些什么来提高我的表现？

根据要求，这是一个示例查询。

{
"index": "publications",
"body": {
    "query": {
        "filtered": {
            "filter": {
                "bool": {
                    "must": [
                        [{
                            "range": {
                                "pubdate": {
                                    "gte": "2010-01-01T00:00:00",
                                    "lte": "2016-01-01T00:00:00"
                                }
                            }
                        }, {
                            "term": {
                                "author": "daniel"
                            }
                        }, {
                            "term": {
                                "title": "rock"
                            }
                        }]
                    ]
                }
            },
            "query": {
                "query_string": {
                    "fields": [
                        ["title", "author", "files.data", "articleId"]
                    ],
                    "query": "Hard"
                }
            }
        }
    },
    "highlight": {
        "fields": {
            "title": {},
            "author": {},
            "articleId": {},
            "exact_articleId": {},
            "files.data": {}
        }
    },
    "sort": {
        "date": {
            "order": "desc"
        },
        "_score": {
            "order": "desc"
        }
    },
    "size": 10,
    "from": 0,
    "fields": ["id", "title", "pubdate", "orderable", "articleId", "author", "languages", "types", "exact_title", "files.file", "files.name", "bibas_date"],
    "_source": ["files.file", "files.name"]
}

}

这是我的映射

{
"publications": {
    "aliases": {},
    "mappings": {
        "publication": {
            "properties": {
                "articleId": {
                    "type": "string",
                    "store": true,
                    "analyzer": "analyzer_standard_whitespace"
                },
                "author": {
                    "type": "string",
                    "store": true,
                    "analyzer": "analyzer_standard",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                },
                "date": {
                    "type": "date",
                    "store": true,
                    "format": "dateOptionalTime"
                },
                "created": {
                    "type": "date",
                    "store": true,
                    "format": "dateOptionalTime"
                },
                "exact_articleId": {
                    "type": "string",
                    "store": true,
                    "analyzer": "analyzer_keyword"
                },
                "exact_title": {
                    "type": "string",
                    "store": true,
                    "analyzer": "analyzer_keyword"
                },
                "files": {
                    "properties": {
                        "data": {
                            "type": "attachment",
                            "path": "full",
                            "fields": {
                                "data": {
                                    "type": "string",
                                    "store": true,
                                    "term_vector": "with_positions_offsets",
                                    "analyzer": "analyzer_standard_whitespace"
                                },
                                "author": {
                                    "type": "string"
                                },
                                "title": {
                                    "type": "string"
                                },
                                "name": {
                                    "type": "string"
                                },
                                "date": {
                                    "type": "date",
                                    "format": "dateOptionalTime"
                                },
                                "keywords": {
                                    "type": "string"
                                },
                                "content_type": {
                                    "type": "string"
                                },
                                "content_length": {
                                    "type": "integer"
                                },
                                "language": {
                                    "type": "string"
                                }
                            }
                        },
                        "description": {
                            "type": "string",
                            "store": true
                        },
                        "file": {
                            "type": "string",
                            "store": true,
                            "analyzer": "analyzer_keyword"
                        },
                        "name": {
                            "type": "string",
                            "store": true,
                            "analyzer": "analyzer_keyword"
                        }
                    }
                },
                "id": {
                    "type": "integer",
                    "store": true
                },
                "keywords": {
                    "type": "string",
                    "store": true,
                    "analyzer": "analyzer_keyword"
                },
                "languages": {
                    "type": "string",
                    "store": true,
                    "analyzer": "analyzer_keyword"
                },
                "orderable": {
                    "type": "boolean",
                    "store": true
                },
                "pubdate": {
                    "type": "date",
                    "store": true,
                    "format": "dateOptionalTime"
                },
                "title": {
                    "type": "string",
                    "store": true,
                    "analyzer": "analyzer_standard",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                },
                "types": {
                    "type": "string",
                    "store": true,
                    "analyzer": "analyzer_keyword"
                },
                "updated": {
                    "type": "date",
                    "store": true,
                    "format": "dateOptionalTime"
                }
            }
        }
    },
    "settings": {
        "index": {
            "creation_date": "1451478916341",
            "analysis": {
                "analyzer": {
                    "analyzer_standard_whitespace": {
                        "filter": "lowercase",
                        "tokenizer": "whitespace"
                    },
                    "analyzer_standard": {
                        "filter": "lowercase",
                        "tokenizer": "standard"
                    },
                    "analyzer_keyword": {
                        "filter": "lowercase",
                        "tokenizer": "keyword"
                    }
                }
            },
            "number_of_shards": "5",
            "uuid": "PjlrtJUUT1CrGduRpxPETw",
            "version": {
                "created": "1070399"
            },
            "number_of_replicas": "1"
        }
    },
    "warmers": {}
}

}

【问题讨论】：

我会将所有 PDF 拆分为单页（可以使用 PDFBox 和类似工具轻松完成）并将每个页面存储在不同的 ES 文档中，这肯定会更快。
@vlad.golubev 细节几乎不够。查询是什么，执行了哪些测试，是否以任何方式监控性能，日志显示什么，瓶颈在哪里，映射是什么，所有节点的设置等。这是一篇非常开放的帖子。
我同意你的看法@AndreiStefan。尽管如此，Val 的评论是有效的，并且清除此类用例中的瓶颈也很重要。但我仍然投票关闭它过于广泛而无法回答。（ps由于赏金无法关闭）：/
我还将ES_HEAP_SIZE 增加到 4GB，因为你有 8GB，你可以放心地将一半给 ES。但是绝对要拆分您的 PDF 并为每个文档索引一个页面，我曾经有一个类似的用例，结果证明效果很好，但正如@AndreiStefan 所说，我们仍然缺少很多信息来正确帮助您。
有什么细节可以更新问题吗？我当然可以发布查询，但找不到任何带有性能信息的日志，也没有执行任何测试。关于增加 ES_HEAP_SIZE 我有点不确定，因为我在同一台机器上有网站，也许我只需要升级它。

标签： elasticsearch

【解决方案1】：

尝试从 _source 中排除大字段（数据）。我认为它肯定会解决您的问题。

【讨论】：

除了 query_string 查询之外，我没有在任何地方包含数据，因为它需要用于搜索文件中的信息。
映射中有源中的数据字段。要将其从映射中排除，请添加如下内容：“_source”：{“excludes”：[“files.data.data”]}。
我认为您没有理解我的问题，删除我要搜索的数据不是解决方案。
@Umo 你仍然可以搜索这些数据，你没有从索引中删除数据。您只需以其他方式存储它。试试吧。
我已经阅读了一些内容并理解了您的建议。这可以工作，谢谢。不过现在还不能测试，我得等几个月才能完成我目前的项目。