Elasticsearch 通过过滤的子文档计数过滤父级答案

【问题标题】：Elasticsearch Filtering Parents by Filtered Child Document CountElasticsearch 通过过滤的子文档计数过滤父级
【发布时间】：2016-01-05 06:25:25
【问题描述】：

我正在尝试对我拥有的一组数据进行一些弹性搜索查询。我有一个用户文档，它是许多子页面视图文档的父级。我希望返回查看特定页面任意次数（由用户输入框定义）的所有用户。到目前为止，我有一个 has_child 查询，它将返回所有具有特定 id 的页面视图的用户。但是，这将返回那些父母和他们所有的孩子。接下来，我尝试在这些查询结果上编写一个聚合，它本质上将以聚合形式执行相同的 has_child 查询。现在，我对过滤后的子文档有了正确的文档计数。我需要使用此文档计数返回并过滤父母。用文字来解释查询，“将所有浏览特定页面超过 4 次的用户返回给我”。我可能需要重组我的数据。有什么想法吗？

这是我目前的查询：

curl -XGET 'http://localhost:9200/development_users/_search?pretty=true' -d '
{
    "query" : { 
      "has_child" : {
        "type" : "page_view",
        "query" : {
          "terms" : {
            "viewed_id" : [175,180]
          }
        }
      }
    },
    "aggs" : {
      "to_page_view": {
        "children": {
          "type" : "page_view"
        },
        "aggs" : {
          "page_views_that_match" : {
            "filter" : { "terms": { "viewed_id" : [175,180] } }
          }
        }
      }
    }
}'

这会返回如下响应：

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "development_users",
      "_type" : "user",
      "_id" : "22548",
      "_score" : 1.0,
      "_source":{"id":22548,"account_id":1009}
    } ]
  },
  "aggregations" : {
    "to_page_view" : {
      "doc_count" : 53,
      "page_views_that_match" : {
        "doc_count" : 2
      }
    }
  }
}

关联映射：

{
  "development_users" : {
    "mappings" : {
      "page_view" : {
        "dynamic" : "false",
        "_parent" : {
          "type" : "user"
        },
        "_routing" : {
          "required" : true
        },
        "properties" : {
          "created_at" : {
            "type" : "date",
            "format" : "date_time"
          },
          "id" : {
            "type" : "integer"
          },
          "viewed_id" : {
            "type" : "integer"
          },
          "time_on_page" : {
            "type" : "integer"
          },
          "title" : {
            "type" : "string"
          },
          "type" : {
            "type" : "string"
          },
          "updated_at" : {
            "type" : "date",
            "format" : "date_time"
          },
          "url" : {
            "type" : "string"
          }
        }
      },
      "user" : {
        "dynamic" : "false",
        "properties" : {
          "account_id" : {
            "type" : "integer"
          },
          "id" : {
            "type" : "integer"
          }
        }
      }
    }
  }
}

【问题讨论】：

"id" 和 "viewable_id" 一样吗？一般来说，发布您的地图可以让人们更容易弄清楚如何回答您的问题。
是的，是我的错字，是 id。我也刚刚添加了映射。
酷，谢谢。我想我知道该怎么做，现在开始测试。
嗯，我想我明白了。 "page_view.id" 是页面 id 吗？所以可以有很多"page_view"s 和相同的"id"，对吧？
其实我又犯了一个错误。它应该是一个单独的字段“viewed_id”。是的，可以有多个具有相同“viewed_id”的页面浏览量，并且应该计算在内。

标签： elasticsearch

【解决方案1】：

好的，所以这有点涉及。我做了一些简化以保持头脑清醒。首先，我使用了这个映射：

PUT /test_index
{
    "mappings": {
        "page_view": {
            "_parent": {
               "type": "development_user"
            },
            "properties": {
                "viewed_id": {
                    "type": "string"
                }
            }
        },
        "development_user": {
            "properties": {
                "id": {
                    "type": "string"
                }
            }
        }
    }
}

然后我添加了一些数据。在这个小宇宙中，我有三个用户和两个页面。我想找到至少查看过"page_a" 两次的用户，所以如果我构造正确的查询，只会返回用户3。

POST /test_index/development_user/_bulk
{"index":{"_type":"development_user","_id":1}}
{"id":"user_1"}
{"index":{"_type":"page_view","_parent":1}}
{"viewed_id":"page_a"}
{"index":{"_type":"development_user","_id":2}}
{"id":"user_2"}
{"index":{"_type":"page_view","_parent":2}}
{"viewed_id":"page_b"}
{"index":{"_type":"development_user","_id":3}}
{"id":"user_3"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_b"}

为了得到这个答案，我们将使用聚合。请注意，我不希望返回文档（正常方式），但我确实希望过滤掉我们分析的文档，因为它会使事情变得更有效率。所以我使用了与你之前相同的基本过滤器。

所以聚合树以terms_parent_id 开头，它只会分隔父文档。在里面我有children_page_view，它将子文档过滤到我想要的那些（"page_a"），在层次结构中它旁边是bucket_selector_page_id_term_count，它使用bucket selector（你需要ES 2.x ) 以通过符合条件的文档过滤 parent 文档，最后是 top hits aggregation，它向我们展示了符合要求的文档。

POST /test_index/development_user/_search
{
   "size": 0,
   "query": {
      "has_child": {
         "type": "page_view",
         "query": {
            "terms": {
               "viewed_id": [
                  "page_a"
               ]
            }
         }
      }
   },
   "aggs": {
      "terms_parent_id": {
         "terms": {
            "field": "id"
         },
         "aggs": {
            "children_page_view": {
               "children": {
                  "type": "page_view"
               },
               "aggs": {
                  "filter_page_ids": {
                     "filter": {
                        "terms": {
                           "viewed_id": [
                              "page_a"
                           ]
                        }
                     }
                  }
               }
            },
            "bucket_selector_page_id_term_count": {
               "bucket_selector": {
                  "buckets_path": {
                     "children_count": "children_page_view>filter_page_ids._count"
                  },
                  "script": "children_count >= 2"
               }
            },
            "top_hits_users": {
               "top_hits": {
                  "_source": {
                     "include": [
                        "id"
                     ]
                  }
               }
            }
         }
      }
   }
}

{
   "took": 14,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "terms_parent_id": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "user_3",
               "doc_count": 1,
               "children_page_view": {
                  "doc_count": 3,
                  "filter_page_ids": {
                     "doc_count": 2
                  }
               },
               "top_hits_users": {
                  "hits": {
                     "total": 1,
                     "max_score": 1,
                     "hits": [
                        {
                           "_index": "test_index",
                           "_type": "development_user",
                           "_id": "3",
                           "_score": 1,
                           "_source": {
                              "id": "user_3"
                           }
                        }
                     ]
                  }
               }
            }
         ]
      }
   }
}

这是我使用的所有代码：

http://sense.qbox.io/gist/43f24461448519dc884039db40ebd8e2f5b7304f

【讨论】：

不错的一个！很高兴在更大的数据集上对此进行测试！
其实我刚才想到，将这种方法推广到多个"viewed_id"s 可能并非易事。不过，我认为这是完全可行的。
脚本的语法已经改变，以后的版本应该是 "script": "params.children_count >= 2"