【问题标题】:Pivot on Multiple Columns using Tablefunc使用 Tablefunc 旋转多列
【发布时间】:2013-03-03 03:59:42
【问题描述】:

有没有人使用tablefunc 而不是仅使用行名 对多个变量进行透视? The documentation notes

“额外”列对于所有带有 相同的 row_name 值。

如果不组合我想要以枢轴为中心的列,我不确定如何做到这一点(我非常怀疑这会给我所需的速度)。一种可能的方法是将实体设为数字并将其以毫秒为单位添加到本地,但这似乎是一种不稳定的继续方式。

我已编辑用于回答此问题的数据:PostgreSQL Crosstab Query

 CREATE TEMP TABLE t4 (
  timeof   timestamp
 ,entity    character
 ,status    integer
 ,ct        integer);

 INSERT INTO t4 VALUES 
  ('2012-01-01', 'a', 1, 1)
 ,('2012-01-01', 'a', 0, 2)
 ,('2012-01-02', 'b', 1, 3)
 ,('2012-01-02', 'c', 0, 4);

 SELECT * FROM crosstab(
     'SELECT timeof, entity, status, ct
      FROM   t4
      ORDER  BY 1,2,3'
     ,$$VALUES (1::text), (0::text)$$)
 AS ct ("Section" timestamp, "Attribute" character, "1" int, "0" int);

返回:

部分 |属性 | 1 | 0 ---------------+-----------+---+--- 2012-01-01 00:00:00 |一个 | 1 | 2 2012-01-02 00:00:00 | b | 3 | 4

因此,正如文档所述,extra 列(即“属性”)对于每个 行名称(即“节”)都假定相同。因此,它会报告第二行的 b,即使 'entity' 也具有该 'timeof' 值的 'c' 值。

期望的输出:

Section                   | Attribute | 1 | 0
--------------------------+-----------+---+---
2012-01-01 00:00:00       |     a     | 1 | 2
2012-01-02 00:00:00       |     b     | 3 |  
2012-01-02 00:00:00       |     c     |   | 4

有什么想法或参考吗?

更多背景知识:我可能需要对 数十亿 行执行此操作,我正在测试以长格式和宽格式存储这些数据,看看是否可以使用 tablefunc从长格式到宽格式比使用常规聚合函数更有效。
我将每分钟对大约 300 个实体进行大约 100 次测量。通常,我们需要比较给定实体在给定秒内所做的不同测量,因此我们需要经常使用宽格式。此外,对特定实体进行的测量是高度可变的。

编辑:我找到了一个资源:http://www.postgresonline.com/journal/categories/24-tablefunc

【问题讨论】:

  • +1 很好的问题,带有(高度赞赏的)工作测试用例,清楚地展示了问题。仅缺少 PostgreSQL 版本。我假设当前版本为 9.2。但最后几个主要版本的解决方案是相同的。
  • 该解决方案仅对这个问题有效,我的意思只是“这个问题”,因为它对两列索引上的数据透视表无效。所以,标题不正确或者答案应该是第二个(id=answer-15559942)

标签: sql postgresql pivot crosstab


【解决方案1】:

您查询的问题是 bc 共享相同的时间戳 2012-01-02 00:00:00,而您有 timestamptimeof首先在您的查询中,所以 - 即使您添加了粗体强调 - bc 只是属于同一组 2012-01-02 00:00:00 的额外列。自 (quoting the manual) 以来仅返回第一个 (b):

row_name 列必须是第一个。 categoryvalue 列必须是最后两列,按此顺序。 row_namecategory 之间的任何列都被视为“额外”。对于具有相同row_name 值的所有行,“额外”列应该是相同的

我的大胆强调。
只需恢复前两列的顺序以使 entity 成为行名,它就可以按需要工作:

SELECT * FROM crosstab(
      'SELECT entity, timeof, status, ct
       FROM   t4
       ORDER  BY 1'
      ,'VALUES (1), (0)')
 AS ct (
    "Attribute" character
   ,"Section" timestamp
   ,"status_1" int
   ,"status_0" int);

entity 当然必须是唯一的。

重申

  • row_name 第一
  • (可选)extra下一个
  • category(由第二个参数定义)和value last

从每个row_name 分区的第一 行填充额外的列。其他行的值将被忽略,每个 row_name 只有一列要填充。通常情况下,row_name 的每一行都是相同的,但这取决于您。

对于不同的设置in your answer:

SELECT localt, entity
     , msrmnt01, msrmnt02, msrmnt03, msrmnt04, msrmnt05  -- , more?
FROM   crosstab(
        'SELECT dense_rank() OVER (ORDER BY localt, entity)::int AS row_name
              , localt, entity -- additional columns
              , msrmnt, val
         FROM   test
         -- WHERE  ???   -- instead of LIMIT at the end
         ORDER  BY localt, entity, msrmnt
         -- LIMIT ???'   -- instead of LIMIT at the end
     , $$SELECT generate_series(1,5)$$)  -- more?
     AS ct (row_name int, localt timestamp, entity int
          , msrmnt01 float8, msrmnt02 float8, msrmnt03 float8, msrmnt04 float8, msrmnt05 float8 -- , more?
            )
LIMIT 1000  -- ??!!

难怪您的测试中的查询执行得非常糟糕。您的测试设置有 1400 万行,您处理所有行,然后用LIMIT 1000 丢弃大部分行。对于减少的结果集,将 WHERE 条件或 LIMIT 添加到源查询!

此外,您使用的阵列在它之上是不必要的昂贵。我改为使用 dense_rank() 生成一个代理行名称。

db<>fiddle 此处 - 测试设置更简单,行数更少。

【讨论】:

  • 嘿欧文,非常感谢您的回复。我会继续给你信用,但我想我在询问中留下了一些模棱两可的地方。有时间我会回帖的。
  • @AndreSilva:我相信有办法。请以 question 的形式提出您的问题。评论不是地方。您可以随时链接到此以获取上下文,并在此处发表评论以链接回以引起我的注意。
【解决方案2】:

好的,所以我在离我的用例更近的桌子上运行了这个。要么我做错了,要么交叉表不适合我使用。

首先我做了一些类似的数据:

CREATE TABLE public.test (
    id serial primary key,
    msrmnt integer,
    entity integer,
    localt timestamp,
    val    double precision
);
CREATE INDEX ix_test_msrmnt
   ON public.test (msrmnt);
 CREATE INDEX ix_public_test_201201_entity
   ON public.test (entity);
CREATE INDEX ix_public_test_201201_localt
  ON public.test (localt);
insert into public.test (msrmnt, entity, localt, val)
select *
from(
SELECT msrmnt, entity, localt, random() as val 
FROM generate_series('2012-01-01'::timestamp, '2012-01-01 23:59:00'::timestamp, interval '1 minutes') as localt
join 
(select *
FROM generate_series(1, 50, 1) as msrmnt) as msrmnt
on 1=1
join 
(select *
FROM generate_series(1, 200, 1) as entity) as entity
on 1=1) as data;

然后我运行了几次交叉表代码:

explain analyze
SELECT (timestamp 'epoch' + row_name[1] * INTERVAL '1 second')::date As localt, row_name[2] as entity
    ,msrmnt01,msrmnt02,msrmnt03,msrmnt04,msrmnt05,msrmnt06,msrmnt07,msrmnt08,msrmnt09,msrmnt10
    ,msrmnt11,msrmnt12,msrmnt13,msrmnt14,msrmnt15,msrmnt16,msrmnt17,msrmnt18,msrmnt19,msrmnt20
    ,msrmnt21,msrmnt22,msrmnt23,msrmnt24,msrmnt25,msrmnt26,msrmnt27,msrmnt28,msrmnt29,msrmnt30
    ,msrmnt31,msrmnt32,msrmnt33,msrmnt34,msrmnt35,msrmnt36,msrmnt37,msrmnt38,msrmnt39,msrmnt40
    ,msrmnt41,msrmnt42,msrmnt43,msrmnt44,msrmnt45,msrmnt46,msrmnt47,msrmnt48,msrmnt49,msrmnt50
    FROM crosstab('SELECT ARRAY[extract(epoch from localt), entity] as row_name, msrmnt, val
               FROM public.test
               ORDER BY localt, entity, msrmnt',$$VALUES  ( 1::text),( 2::text),( 3::text),( 4::text),( 5::text),( 6::text),( 7::text),( 8::text),( 9::text),(10::text)
                                                         ,(11::text),(12::text),(13::text),(14::text),(15::text),(16::text),(17::text),(18::text),(19::text),(20::text)
                                                         ,(21::text),(22::text),(23::text),(24::text),(25::text),(26::text),(27::text),(28::text),(29::text),(30::text)
                                                         ,(31::text),(32::text),(33::text),(34::text),(35::text),(36::text),(37::text),(38::text),(39::text),(40::text)
                                                         ,(41::text),(42::text),(43::text),(44::text),(45::text),(46::text),(47::text),(48::text),(49::text),(50::text)$$)
        as ct (row_name integer[],msrmnt01 double precision, msrmnt02 double precision,msrmnt03 double precision, msrmnt04 double precision,msrmnt05 double precision, 
                    msrmnt06 double precision,msrmnt07 double precision, msrmnt08 double precision,msrmnt09 double precision, msrmnt10 double precision
                 ,msrmnt11 double precision, msrmnt12 double precision,msrmnt13 double precision, msrmnt14 double precision,msrmnt15 double precision, 
                    msrmnt16 double precision,msrmnt17 double precision, msrmnt18 double precision,msrmnt19 double precision, msrmnt20 double precision
                 ,msrmnt21 double precision, msrmnt22 double precision,msrmnt23 double precision, msrmnt24 double precision,msrmnt25 double precision, 
                    msrmnt26 double precision,msrmnt27 double precision, msrmnt28 double precision,msrmnt29 double precision, msrmnt30 double precision
                 ,msrmnt31 double precision, msrmnt32 double precision,msrmnt33 double precision, msrmnt34 double precision,msrmnt35 double precision, 
                    msrmnt36 double precision,msrmnt37 double precision, msrmnt38 double precision,msrmnt39 double precision, msrmnt40 double precision
                 ,msrmnt41 double precision, msrmnt42 double precision,msrmnt43 double precision, msrmnt44 double precision,msrmnt45 double precision, 
                    msrmnt46 double precision,msrmnt47 double precision, msrmnt48 double precision,msrmnt49 double precision, msrmnt50 double precision)
limit 1000

第三次获得这个:

QUERY PLAN
Limit  (cost=0.00..20.00 rows=1000 width=432) (actual time=110236.673..110237.667 rows=1000 loops=1)
  ->  Function Scan on crosstab ct  (cost=0.00..20.00 rows=1000 width=432) (actual time=110236.672..110237.598 rows=1000 loops=1)
Total runtime: 110699.598 ms

然后我运行了几次标准解决方案:

explain analyze
select localt, entity, 
 max(case when msrmnt =  1 then val else null end) as msrmnt01
,max(case when msrmnt =  2 then val else null end) as msrmnt02
,max(case when msrmnt =  3 then val else null end) as msrmnt03
,max(case when msrmnt =  4 then val else null end) as msrmnt04
,max(case when msrmnt =  5 then val else null end) as msrmnt05
,max(case when msrmnt =  6 then val else null end) as msrmnt06
,max(case when msrmnt =  7 then val else null end) as msrmnt07
,max(case when msrmnt =  8 then val else null end) as msrmnt08
,max(case when msrmnt =  9 then val else null end) as msrmnt09
,max(case when msrmnt = 10 then val else null end) as msrmnt10
,max(case when msrmnt = 11 then val else null end) as msrmnt11
,max(case when msrmnt = 12 then val else null end) as msrmnt12
,max(case when msrmnt = 13 then val else null end) as msrmnt13
,max(case when msrmnt = 14 then val else null end) as msrmnt14
,max(case when msrmnt = 15 then val else null end) as msrmnt15
,max(case when msrmnt = 16 then val else null end) as msrmnt16
,max(case when msrmnt = 17 then val else null end) as msrmnt17
,max(case when msrmnt = 18 then val else null end) as msrmnt18
,max(case when msrmnt = 19 then val else null end) as msrmnt19
,max(case when msrmnt = 20 then val else null end) as msrmnt20
,max(case when msrmnt = 21 then val else null end) as msrmnt21
,max(case when msrmnt = 22 then val else null end) as msrmnt22
,max(case when msrmnt = 23 then val else null end) as msrmnt23
,max(case when msrmnt = 24 then val else null end) as msrmnt24
,max(case when msrmnt = 25 then val else null end) as msrmnt25
,max(case when msrmnt = 26 then val else null end) as msrmnt26
,max(case when msrmnt = 27 then val else null end) as msrmnt27
,max(case when msrmnt = 28 then val else null end) as msrmnt28
,max(case when msrmnt = 29 then val else null end) as msrmnt29
,max(case when msrmnt = 30 then val else null end) as msrmnt30
,max(case when msrmnt = 31 then val else null end) as msrmnt31
,max(case when msrmnt = 32 then val else null end) as msrmnt32
,max(case when msrmnt = 33 then val else null end) as msrmnt33
,max(case when msrmnt = 34 then val else null end) as msrmnt34
,max(case when msrmnt = 35 then val else null end) as msrmnt35
,max(case when msrmnt = 36 then val else null end) as msrmnt36
,max(case when msrmnt = 37 then val else null end) as msrmnt37
,max(case when msrmnt = 38 then val else null end) as msrmnt38
,max(case when msrmnt = 39 then val else null end) as msrmnt39
,max(case when msrmnt = 40 then val else null end) as msrmnt40
,max(case when msrmnt = 41 then val else null end) as msrmnt41
,max(case when msrmnt = 42 then val else null end) as msrmnt42
,max(case when msrmnt = 43 then val else null end) as msrmnt43
,max(case when msrmnt = 44 then val else null end) as msrmnt44
,max(case when msrmnt = 45 then val else null end) as msrmnt45
,max(case when msrmnt = 46 then val else null end) as msrmnt46
,max(case when msrmnt = 47 then val else null end) as msrmnt47
,max(case when msrmnt = 48 then val else null end) as msrmnt48
,max(case when msrmnt = 49 then val else null end) as msrmnt49
,max(case when msrmnt = 50 then val else null end) as msrmnt50
from sample
group by localt, entity
limit 1000

第三次获得这个:

QUERY PLAN
Limit  (cost=2257339.69..2270224.77 rows=1000 width=24) (actual time=19795.984..20090.626 rows=1000 loops=1)
  ->  GroupAggregate  (cost=2257339.69..5968242.35 rows=288000 width=24) (actual time=19795.983..20090.496 rows=1000 loops=1)
        ->  Sort  (cost=2257339.69..2293339.91 rows=14400088 width=24) (actual time=19795.626..19808.820 rows=50001 loops=1)
              Sort Key: localt
              Sort Method: external merge  Disk: 478568kB
              ->  Seq Scan on sample  (cost=0.00..249883.88 rows=14400088 width=24) (actual time=0.013..2245.247 rows=14400000 loops=1)
Total runtime: 20197.565 ms

因此,就我而言,到目前为止,交叉表似乎不是一个解决方案。这只是我将拥有多年的一天。事实上,我可能不得不使用宽格式(非规范化)表,尽管对实体进行的测量是可变的并且引入了新的测量,但我不会在这里讨论。

这是我使用 Postgres 9.2.3 的一些设置:

name                    setting
max_connections             100
shared_buffers          2097152
effective_cache_size    6291456
maintenance_work_mem    1048576
work_mem                 262144

【讨论】:

  • 刚刚偶然发现了这个。那时候没注意。我为我的答案添加了一个解决方案。现在为时已晚,但我一直在链接到这个问题,我不想让这个问题无人回答。
【解决方案3】:

在我最初的问题中,我应该将它用于我的示例数据:

CREATE TEMP TABLE t4 (
 timeof    date
,entity    integer
,status    integer
,ct        integer);
INSERT INTO t4 VALUES 
 ('2012-01-01', 1, 1, 1)
,('2012-01-01', 1, 0, 2)
,('2012-01-01', 3, 0, 3)
,('2012-01-02', 2, 1, 4)
,('2012-01-02', 3, 1, 5)
,('2012-01-02', 3, 0, 6);

有了这个,我必须同时关注时间和实体。由于tablefunc 仅使用一列进行旋转,您需要找到一种方法将两个维度都填充到该列中。 (http://www.postgresonline.com/journal/categories/24-tablefunc)。我选择了数组,就像那个链接中的例子一样。

SELECT (timestamp 'epoch' + row_name[1] * INTERVAL '1 second')::date 
           as localt, 
           row_name[2] As entity, status1, status0
FROM crosstab('SELECT ARRAY[extract(epoch from timeof), entity] as row_name,
                    status, ct
               FROM t4 
               ORDER BY timeof, entity, status'
     ,$$VALUES (1::text), (0::text)$$) 
          as ct (row_name integer[], status1 int, status0 int)

FWIW,我尝试使用字符数组,到目前为止,我的设置看起来更快; 9.2.3 PostgreSQL。

这是结果和期望的输出。

localt           | entity | status1 | status0
--------------------------+---------+--------
2012-01-01       |   1    |    1    |   2
2012-01-01       |   3    |         |   3
2012-01-02       |   2    |    4    |  
2012-01-02       |   3    |    5    |   6

我很好奇它在更大的数据集上的表现如何,并将在以后报告。

【讨论】:

猜你喜欢
  • 1970-01-01
  • 2022-01-23
  • 2019-01-02
  • 1970-01-01
  • 2021-05-08
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多