压缩块：性能和数据大小答案

【问题标题】：Compressed chunks: performance and data size压缩块：性能和数据大小
【发布时间】：2022-01-02 04:02:33
【问题描述】：

我是 TimescaleDB 的新手，并开始探索文档。很清楚，看起来我错过了一些重要的事情。

我已经创建了一个表：

CREATE TABLE session_created
(
    event_timestamp timestamp without time zone NOT NULL,
    client_id integer,
    client_version text,
    identifier text,
    platform text,
    remote_addr text,
    country text,
    type smallint,
    session text
);

CREATE INDEX session_created_client_id_idx ON session_created USING btree (client_id ASC NULLS LAST);
CREATE INDEX session_created_event_timestamp_idx ON session_created USING btree (event_timestamp ASC NULLS LAST);

然后使用以下压缩设置使其成为超表：

SELECT create_hypertable('session_created','event_timestamp');

ALTER TABLE session_created SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'event_timestamp'
);

SELECT add_compression_policy('session_created', INTERVAL '1 days');

从 2021 年 11 月 9 日到 2021 年 11 月 2 日，每天将表格填满一百万行。像这样：

INSERT INTO session_created
(
    event_timestamp,
    client_id,
    client_version,
    identifier,
    platform,
    remote_addr,
    country,
    type,
    session
)
SELECT '2021-11-23 00:00:00'::timestamp + s.id * interval '85 milliseconds', 
    s.id % 500000, 
    '1.0.1234', 
    'deviceid-' || s.id % 500000,
    'Android',
    '127.0.0.' || s.id % 256,
    'RU',
    0,
    md5(random()::text || clock_timestamp()::text)::uuid
FROM generate_series(1, 1000000) AS s(id);

目标表包含三个块，我压缩了其中两个以使用核心 TimescaleDB 功能：

SELECT compress_chunk(chunk_name)
FROM show_chunks('session_created', older_than => INTERVAL ' 1 day') chunk_name;

问题是压缩后的数据比压缩前的数据占用了三倍的空间。

before compression	after compression
1702 MB	4178 MB

此外，以分析的方式查询压缩数据需要更多时间：

select event_timestamp::date as date, count(distinct client_id) as clients
from session_created
where event_timestamp between (current_date - 7)::date and (current_date - 6)::date - interval '1 second'
group by 1

问题是我在文档中遗漏了什么？问题的根源是什么？

【问题讨论】：

标签： timescaledb

【解决方案1】：

在设置压缩时，通常不会将用于创建超表的“时间”列用作逐列分段。 segment_by 列是跨数据集具有一些共性的列。例如如果我们有一张带有设备读数的表格 (device_id, event_timestamp, event_id, 阅读) 按列分段可以是 device_id（假设您有几 1000 个设备，而 device_readings 表的数据量为数百万/十亿）。请注意，逐列分段中的数据永远不会以压缩形式存储。只有非按列分段被压缩。

【讨论】：

感谢您的解释，我错过了分段的想法。如果client_id 列用于分段，压缩就足够了。在 where 子句中使用 client_id = ? 的查询效果很好，但是像上面这样的分析查询仍然显示出更差的性能