ElasticSearch - 结合过滤器和复合查询以获得独特的字段组合答案

【问题标题】：ElasticSearch - Combine filters & Composite Query to get unique fields combinationsElasticSearch - 结合过滤器和复合查询以获得独特的字段组合
【发布时间】：2021-07-20 12:44:01
【问题描述】：

嗯..我对 ES 非常“新手”，所以关于聚合...字典中没有任何词可以描述我的水平：p

今天我面临一个问题，我正在尝试创建一个查询，该查询应该执行类似于 SQL DISTINCT 的内容，但在过滤器之间。我有这个文档（当然是对真实情况的抽象）：

{
  "id": "1",
  "createdAt": 1626783747,
  "updatedAt": 1626783747,
  "isAvailable": true,
  "kind": "document",
  "classification": {
    "id": 1,
    "name": "a_name_for_id_1"
  },
  "structure": {
    "material": "cartoon",
    "thickness": 5
  },
  "shared": true,
  "objective": "stackoverflow"
}

由于上述文档的所有数据都可能有所不同，因此我有一些可能是多余的值，例如classification.id、kind、structure.material。

因此，为了满足我的要求，我想对这 3 个字段进行“分组”，以便对每个字段进行独特的组合。如果我们再深入一点，通过以下数据，我应该得到以下可能性：

[{
        "id": "1",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": true,
        "kind": "document",
        "classification": {
            "id": 1,
            "name": "a_name_for_id_1"
        },
        "structure": {
            "material": "cartoon",
            "thickness": 5
        },
        "shared": true,
        "objective": "stackoverflow"
    },
    {
        "id": "2",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": true,
        "kind": "document",
        "classification": {
            "id": 2,
            "name": "a_name_for_id_2"
        },
        "structure": {
            "material": "iron",
            "thickness": 3
        },
        "shared": true,
        "objective": "linkedin"
    },
    {
        "id": "3",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": false,
        "kind": "document",
        "classification": {
            "id": 2,
            "name": "a_name_for_id_2"
        },
        "structure": {
            "material": "paper",
            "thickness": 1
        },
        "shared": false,
        "objective": "tiktok"
    },
    {
        "id": "4",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": true,
        "kind": "document",
        "classification": {
            "id": 3,
            "name": "a_name_for_id_3"
        },
        "structure": {
            "material": "cartoon",
            "thickness": 5
        },
        "shared": false,
        "objective": "snapchat"
    },
    {
        "id": "5",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": true,
        "kind": "document",
        "classification": {
            "id": 3,
            "name": "a_name_for_id_3"
        },
        "structure": {
            "material": "paper",
            "thickness": 1
        },
        "shared": true,
        "objective": "twitter"
    },
    {
        "id": "6",
        "createdAt": 1626783747,
        "updatedAt": 1626783747,
        "isAvailable": false,
        "kind": "document",
        "classification": {
            "id": 3,
            "name": "a_name_for_id_3"
        },
        "structure": {
            "material": "iron",
            "thickness": 3
        },
        "shared": true,
        "objective": "facebook"
    }
]

基于上述，我应该在“buckets”中得到以下结果：

记录 1 幅漫画
文件 2 铁
文档 2 纸
文档 3 卡通
文档 3 纸
文件 3 铁

当然，为了这个例子（为了方便起见，我还没有任何重复）

但是，除此之外，我只需要一些“预过滤器”：

可用的文档isAvailable=true
文档结构的厚度应介于 2 和 4 之间，包括：2 >= structure.thickness >= 4
共享的文档shared=true

与第一组结果相比，我应该只得到以下组合：

文件 1 动画片 -> not a valid result, thickness > 4
文件 2 铁
文档 2 论文 -> not a valid result, isAvailable != true
文档 3 卡通 -> not a valid result, thickness > 4
文档 3 卡通 -> not a valid result, thickness < 2
文件 3 铁 -> not a valid result, isAvailable != true

如果您还在阅读，那么……谢谢！ xD

因此，如您所见，我需要与静态模式 kind <> classification_id <> structure_material 相关的该字段的所有可能组合，这些组合与 isAvailable, thickness, shared 相关的过滤器匹配。

关于输出，点击对我来说并不重要，因为我不需要文档，而只需要组合 kind <> classification_id <> structure_material :)

感谢您的帮助:)

最大

【问题讨论】：

标签： elasticsearch elasticsearch-aggregation

【解决方案1】：

您可以使用现有过滤器获得 Cardinatily 聚合。请检查此网址，如果您有任何疑问，请告诉我。 https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

【讨论】：

您好，感谢您的回答。我尝试了你的方法但没有成功.. Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default

【解决方案2】：

多亏了一位同事，我终于可以让它按预期工作了！

查询

GET index-latest/_search
{
   "size": 0,
   "query": {
      "bool": {
         "filter": [
            {
               "term": {
                  "isAvailable": true
               }
            },
            {
               "range": {
                  "structure.thickness": {
                     "gte": 2,
                     "lte": 4
                  }
               }
            },
            {
               "term": {
                  "shared": true
               }
            }
         ]
      }
   },
   "aggs": {
      "my_agg_example": {
         "composite": {
            "size": 10,
            "sources": [
               {
                  "kind": {
                     "terms": {
                        "field": "kind.keyword",
                        "order": "asc"
                     }
                  }
               },
               {
                  "classification_id": {
                     "terms": {
                        "field": "classification.id",
                        "order": "asc"
                     }
                  }
               },
               {
                  "structure_material": {
                     "terms": {
                        "field": "structure.material.keyword",
                        "order": "asc"
                     }
                  }
               }
            ]
         }
      }
   }
}

那么给定的结果是：

{
   "took": 11,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "skipped": 0,
      "failed": 0
   },
   "hits": {
      "total": {
         "value": 1,
         "relation": "eq"
      },
      "max_score": null,
      "hits": []
   },
   "aggregations": {
      "my_agg_example": {
         "after_key": {
            "kind": "document",
            "classification_id": 2,
            "structure_material": "iron"
         },
         "buckets": [
            {
               "key": {
                  "kind": "document",
                  "classification_id": 2,
                  "structure_material": "iron"
               },
               "doc_count": 1
            }
         ]
      }
   }
}

所以，如我们所见，我们得到以下存储桶：

{
    "key": {
        "kind": "document",
        "classification_id": 2,
        "structure_material": "iron"
    },
    "doc_count": 1
}

注意：请注意您的字段类型。将.keyword 放在 classification.id 上会导致桶中没有结果...@987654325 @ 应该只用于字符串等类型（据我了解，如果我错了，请纠正我）

正如预期的那样，我们得到了以下结果（与最初的问题相比）：

文件 2 铁

注意：请注意，aggs.<name>.composite.sources 中元素的顺序确实会影响返回的结果。

谢谢！

【讨论】：