识别 PostgreSQL 中的第一个唯一值答案

【问题标题】：Identifying first unique value in PostgreSQL识别 PostgreSQL 中的第一个唯一值
【发布时间】：2022-01-19 01:09:35
【问题描述】：

我有一个表crashes 大约有一百万行，每行包含以下任一数据：

并非发生在学校附近的每起车祸，或
在一所学校附近发生的每起车祸，如果发生在不止一所学校附近，则每次车祸会增加行数（例如，在 4 所学校附近发生的车祸会增加 4 行）。一次撞车事故的最高行数/附近学校数为 10。

我想在表中添加一列，对于出现在多行中的每个 crash_id 仅出现一次返回“1”，对于列中相同 crash_id 的任何后续出现返回“0” crash_id。哪一行的每个 crash_id 有 1 或 0 无关紧要。

我已经尝试了所有回复 this similar question 的建议，但我无法让其中任何一个为我工作。

FWIW，我用这个公式让它在 Excel 中工作：

=(COUNTIF($C$2:$C2,$C2)=1)+0

但那是一张小桌子，而不是一百万行的桌子。

到目前为止我已经尝试过：

SELECT * 
FROM 
( 
    SELECT * , ROW_NUMBER() OVER(PARTITION BY crash_id) AS row 
    FROM crashes 
) AS A1 
WHERE row <6

SELECT * 
FROM 
(
    SELECT * , ROW_NUMBER() OVER(PARTITION BY crash_id) AS row 
    FROM crashes
) AS A1 
WHERE row = 1

我知道这不是最佳的数据库设计，但它可以让我获得我需要的大部分内容，除了我上面描述的内容。

【问题讨论】：

minimal reproducible example 在询问 SQL 问题时是一个很好的开始。另请注意，与家庭作业相关的问题需要付出额外的努力。
"...对于每个唯一 crash_id 的第一次出现..." -- 你如何定义 10 行中的哪一行是第一个？请记住，在关系数据库中，行没有插入顺序。
首先，这是一个糟糕的数据库设计。应该有一个crash 表和一个以crash_id 作为外键的crash-at-school 表。 crash 表中 crash_id 值的重复是代码异味。 crash 表的唯一键是什么？如果没有，您将如何识别要更新的行？
jarlh 这不是家庭作业。如果您要查看到目前为止我尝试过的内容：' SELECT * FROM ( SELECT * , ROW_NUMBER() OVER(PARTITION BY crash_id) AS row from crash ) AS A1 WHERE row
TheImpaler 就我而言，哪个被识别为第一个并不重要。我只想要一个允许我过滤表的列，这样每次崩溃我只能看到一行（即 WHERE id_unique = '1'）。

标签： sql postgresql

【解决方案1】：

只是一个简单的测试，将 first_crash 列添加到崩溃中。

但它需要一些东西来确定哪一行是第一个。因为表格本质上是未排序的集合。
该示例为此使用了 ID。

create table crashes (
 id serial primary key, 
 crash_id int, 
 school_id int
);

alter table crashes 
  add constraint uniq_school_crash unique (crash_id, school_id);

insert into crashes (crash_id, school_id) values
  (101,11), (101,10), (101,12)
, (102,25), (102,24), (102,23)
, (103,null)

alter table crashes
 add column first_crash int default 0;

update crashes c
set first_crash = 1
where first_crash = 0
  and ( 
       school_id is null
    or not exists (
      select 1
      from crashes c2
      where c2.crash_id = c.crash_id
        and c2.id < c.id
    ));

select * from crashes order by id
编号 | crash_id |学校ID | first_crash -: | --------: | --------: | ----------: 1 | 101 | 11 | 1 2 | 101 | 10 | 0 3 | 101 | 12 | 0 4 | 102 | 25 | 1 5 | 102 | 24 | 0 6 | 102 | 23 | 0 7 | 103 | 空 | 1

db小提琴here

额外

按行号更新

-- using row_number
update crashes c
set first_crash = q.rn
from ( 
      select id
      , row_number() over (partition by crash_id 
                           order by id asc) as rn
      from crashes
) q
where q.rn = 1 
  and q.id = c.id;

使用临时表

-- using temporary table
create temporary table tmp_crashes (
 id int primary key, 
 crash_id int
);

insert into tmp_crashes (id, crash_id)
select min(id), crash_id
from crashes
group by crash_id
order by min(id);

update crashes t
set first_crash = 1
from tmp_crashes tmp
where tmp.id = t.id;

【讨论】：

我在我的数据集上尝试了这个，它需要几个小时才能执行。知道为什么/如何加快速度吗？会不会是 1m 行的数据集对于这样的查询来说太大了？
表上的索引是什么？
我没有...
无论你使用什么 sql 技巧，这可能是它运行缓慢的原因之一。
好的，那我得学习一下如何添加索引了。感谢您的帮助。