【问题标题】:Creating a materialized view for calculating histogram data创建用于计算直方图数据的物化视图
【发布时间】:2020-04-11 08:38:25
【问题描述】:

我已经创建了一个表:

CREATE TABLE results
(
    id UUID,
    date_time DateTime,
    item_id UInt32,
    value UInt16
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id);

我想创建一个物化视图来存储每小时直方图数据的价值。例如;

我希望得到这样的输出:

toStartOfHour          item_id    value    count
2019-12-18 00:00:00    1          0        4       /* number of rows with value between 0 and 100 and date_time between 2019-12-18 00:00:00 and 2019-12-18 01:00:00 */
2019-12-18 00:00:00    1          100      7       /* number of rows with value between 100 and 200 and date_time between 2019-12-18 00:00:00 and 2019-12-18 01:00:00 */

value 介于 100 和 0 之间,date_time 介于 2019-12-18 00:00:002019-12-18 01:00:00 之间的行数。我尝试过这样的事情:

CREATE MATERIALIZED VIEW results_histogram_by_hour
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id)
POPULATE
AS SELECT toStartOfHour(date_time) AS date_time,
          item_id,
          multiply(floor(value / 100), 100) AS value,
          countState() AS count
FROM results
GROUP BY date_time,
         item_id,
         value;

此物化视图定义在填充时有效。但是随着时间的推移和新的行,它会出错。怎么错了?我不知道。我找不到模式。

我不确定我是否在 clickhouse 上发现了错误,或者我做错了什么。

我的物化视图定义正确吗?

【问题讨论】:

标签: sql clickhouse


【解决方案1】:

另一种方法

https://clickhouse.yandex/docs/en/operations/table_engines/summingmergetree/#nested-structures

SummingMergeTree 能够对 K/V 数组中的值求和 (列应命名为 ...Map -- valueMap

CREATE TABLE results
(
    id UInt64,
    date_time DateTime,
    item_id UInt32,
    value UInt16
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id);


insert into results 
select number,
       now(),
       number%7 item_id,
       number%9957 value
from numbers(10000);


CREATE MATERIALIZED VIEW results_histogram_by_hour
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id) POPULATE AS
SELECT
    date_time,
    item_id,
    groupArray(value) AS `valueMap.bin`,
    groupArray(cnt) AS `valueMap.cnt`
FROM
(
    SELECT
        toStartOfHour(date_time) AS date_time,
        item_id,
        intDiv(value, 1000) AS value,
        sum(toUInt64(1)) AS cnt
    FROM results
    GROUP BY
        date_time,
        item_id,
        value
)
GROUP BY
    date_time,
    item_id


insert into results 
select number,
       now(),
       number%7 item_id,
       number%9957 value
from numbers(10000);

SELECT *
FROM results_histogram_by_hour
WHERE item_id = 4

─item_id─┬─valueMap.bin──────────┬─valueMap.cnt──────────────────────────────┐
       4 │ [0,7,6,1,5,2,3,4,8,9] │ [149,143,143,143,143,142,143,143,143,136] │
─────────┴───────────────────────┴───────────────────────────────────────────┘
─item_id─┬─valueMap.bin──────────┬─valueMap.cnt──────────────────────────────┐
       4 │ [0,7,6,1,5,2,3,4,8,9] │ [149,143,143,143,143,142,143,143,143,136] │
─────────┴───────────────────────┴───────────────────────────────────────────┘    

SELECT
    date_time,
    item_id,
    sumMap(valueMap.bin, valueMap.cnt)
FROM results_histogram_by_hour
WHERE item_id = 4
GROUP BY
    date_time,
    item_id

─item_id─┬─sumMap(valueMap.bin, valueMap.cnt)────────────────────────────────┐
       4 │ ([0,1,2,3,4,5,6,7,8,9],[298,286,284,286,286,286,286,286,286,272]) │
─────────┴───────────────────────────────────────────────────────────────────┘

optimize table results_histogram_by_hour final;

SELECT *
FROM results_histogram_by_hour
WHERE item_id = 4

─item_id─┬─valueMap.bin──────────┬─valueMap.cnt──────────────────────────────┐
       4 │ [0,1,2,3,4,5,6,7,8,9] │ [298,286,284,286,286,286,286,286,286,272] │
─────────┴───────────────────────┴───────────────────────────────────────────┘

【讨论】:

    【解决方案2】:

    AggregatingMT 使用 order by(主键)作为 Dimensions 所有其他列都是 Metrics。如果 metric 列没有 State 函数,它将由 ANY 计算/折叠

    CREATE table results_histogram_by_hour
    (date_time DateTime,
     item_id UInt32,
     value UInt16,
     count AggregateFunction(count) 
    ) ENGINE = AggregatingMergeTree() 
    PARTITION BY toYYYYMMDD(date_time) 
    ORDER BY (date_time, item_id)
    
    insert into results_histogram_by_hour 
    select toStartOfHour(now()) date_time,
           1 item_id,
           1 value,
           countState()
    group by date_time, item_id, value;
    
    insert into results_histogram_by_hour 
    select toStartOfHour(now()) date_time,
           1 item_id,
          99 value,
           countState()
    group by date_time, item_id, value;
    
    optimize table results_histogram_by_hour final;
    
    select * from results_histogram_by_hour;
    
    ┌───────────date_time─┬─item_id─┬─value─┬─count─┐
    │ 2019-12-18 21:00:00 │       1 │     1 │       │
    └─────────────────────┴─────────┴───────┴───────┘
    
    ORDER BY (date_time, item_id , value)
    ┌───────────date_time─┬─item_id─┬─value─┬─count─┐
    │ 2019-12-18 21:00:00 │       1 │     1 │       │
    │ 2019-12-18 21:00:00 │       1 │    99 │       │
    └─────────────────────┴─────────┴───────┴───────┘
    

    如果不喜欢长/宽/重索引 (PRIMARYKEY) 的想法,可以为 ORDERBY/PRIMARYKEY 使用不同的列集。所有 ENGINE 都使用 ORDERBY 列集进行合并/折叠。

    【讨论】:

      猜你喜欢
      • 2021-05-04
      • 2011-09-17
      • 1970-01-01
      • 2011-11-16
      • 1970-01-01
      • 2014-04-25
      • 1970-01-01
      • 2017-11-08
      • 1970-01-01
      相关资源
      最近更新 更多