【发布时间】:2018-07-26 07:54:35
【问题描述】:
我在所有维基百科文章名称的巨大数据集上使用弹性搜索,它们大约有 500 万个数字数据库字段名称是文章名称
curl -XPUT "http://localhost:9200/index_wiki_articlenames/" -d'
{
"settings":{
"analysis":{
"filter":{
"nGram_filter":{
"type":"edgeNGram",
"min_gram":1,
"max_gram":20,
"token_chars":[
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"tokenizer":{
"edge_ngram_tokenizer":{
"type":"edgeNGram",
"min_gram":"1",
"max_gram":"20",
"token_chars":[
"letter",
"digit"
]
}
},
"analyzer":{
"nGram_analyzer":{
"type":"custom",
"tokenizer":"edge_ngram_tokenizer",
"filter":[
"lowercase",
"asciifolding"
]
}
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
},
"mappings":{
"name":{
"properties":{
"articlenames":{
"type":"text",
"analyzer":"nGram_analyzer"
}
}
}
}
}'
引用这些链接也可以解决我的问题,但徒劳无功
Edge NGram with phrase matching
https://hackernoon.com/elasticsearch-building-autocomplete-functionality-494fcf81a7cf
我的目标是为“sachin t”的输入查询获得如下结果
sachin tendulkar
sachin tendulkar centuries
sachin tejas
sachin top 60 quotes
sachin talwalkar
sachin tawade
sachin taps
对于“sachin te”的查询
sachin tendulkar
sachin tendulkar centuries
sachin tejas
对于“sachin ta”的查询
sachin talwalkar
sachin tawade
sachin taps
对于“sachin 十”的查询
sachin tendulkar
sachin tendulkar centuries
请记住,数据集非常庞大,有些文章名称和单词可能包含特殊字符和单词,例如“Bronisław-Komorowski”
我能够获得多达 10 万条记录的较小数据集的输出,但只要我的数据集更改为 0.5 到 5 百万条记录 我无法获得输出
我的查询是
http://127.0.0.1:9200/index_wiki_articlenames/_search?&q=articlenames:sachin-t+articlenames:sachin-t.*&filter_path=hits.hits._source.articlenames&size=50
【问题讨论】:
-
您的查询是什么?
-
刚刚在文章末尾添加评论
-
关于查询,最好使用查询 DSL (elastic.co/guide/en/elasticsearch/reference/current/…) 而不是搜索 API。
-
as soon as my dataset changes to 0.5 to 5 million records I am unable to get output是什么意思?那会发生什么? ES死了?退货需要太多时间?
标签: elasticsearch search full-text-search n-gram incremental-search