【问题标题】:Need help identifying dups in the table需要帮助识别表中的重复项
【发布时间】:2015-04-17 20:18:37
【问题描述】:

我有什么:

  1. data_source_1
  2. data_source_2
  3. data_sources_view查看

关于表格:

data_source_1:

没有重复:

db=# select count(*) from (select distinct * from data_source_1);
count 
--------
543243
(1 row)

db=# select count(*) from (select * from data_source_1);
count 
--------
543243
(1 row)

data_source_2:

没有重复:

db=# select count(*) from (select * from data_source_2);
count 
-------
5304
(1 row)

db=# select count(*) from (select distinct * from data_source_2);
count 
-------
5304
(1 row)

data_sources_view:

有重复:

db=# select count(*) from (select distinct * from data_sources_vie);
count 
--------
538714
(1 row)

db=# select count(*) from (select * from data_sources_view);
count 
--------
548547
(1 row)

视图很简单:

CREATE VIEW data_sources_view
AS SELECT * 
FROM (
      (
       SELECT a, b, 'data_source_1' as source
       FROM data_source_1
      )
      UNION ALL 
      ( 
       SELECT a, b, 'data_source_2' as source
       FROM data_source_2
      )
);

我想知道的:

  • 在源表没有重复数据的视图中怎么可能有重复数据 + 'data_source_x' as source 消除了重叠数据的可能性。
  • 如何识别重复?

我尝试过的:

db# create table t1 as select * from data_sources_view;
SELECT
db=# 
db=# create table t2 as select distinct * from data_sources_view;
SELECT
db=# create table t3 as select * from t1 minus select * from t2;
SELECT
db=# select 't1' as table_name, count(*) from t1 UNION ALL
db-# select 't2' as table_name, count(*) from t2 UNION ALL
db-# select 't3' as table_name, count(*) from t3;
table_name | count 
------------+--------
t1 | 548547
t3 | 0
t2 | 538714
(3 rows)

数据库:

红移 (PostgreSQL)

【问题讨论】:

  • 要识别重复项,只需执行 select a,b,source from data_sources_view group by a,b,source having count(*) > 1;
  • 差不多了,我收到了9657,谢谢提示

标签: sql postgresql duplicates amazon-redshift


【解决方案1】:

原因是因为您的数据源有两个以上的列。如果你做这些计数:

select count(*) from (select distinct a, b from data_source_1);

select count(*) from (select distinct a, b from data_source_2);

您应该会发现它们与您在同一张桌子上得到的count(*) 不同。

【讨论】:

    【解决方案2】:

    联合与联合所有

    1. UNION - 如果数据存在于 TOP 查询中,它将在底部查询中被抑制。

    输出

    1. UNION ALL - 数据重复,因为数据存在于两个表中(显示两个记录)

    输出

    【讨论】:

      猜你喜欢
      • 2018-09-24
      • 1970-01-01
      • 2023-01-29
      • 1970-01-01
      • 1970-01-01
      • 2021-01-02
      • 1970-01-01
      • 2020-12-12
      • 2019-08-17
      相关资源
      最近更新 更多