【问题标题】:Best way to select records that having more recent date from grouped duplicates从分组重复项中选择具有较新日期的记录的最佳方法
【发布时间】:2017-02-07 13:50:31
【问题描述】:

我有这样的表:

CREATE TABLE [dbo].[TestToDelete](
    [id] [int] NULL,
    [Email] [nvarchar](50) NULL,
    [RawEmail] [nvarchar](50) NULL,
    [Status] [tinyint] NULL,
    [ValidationDate] [datetime] NULL
) ON [PRIMARY]

INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (1, N'a@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 14:00:30.300' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (2, N'a@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 14:00:52.347' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (3, N'a@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 14:00:58.117' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (4, N'a@a.ru', N'aaa@a.ru', 22, CAST(N'2017-02-07 14:01:08.360' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (5, N'b@b.ru', N'bbb@b.ru', 11, CAST(N'2017-02-07 14:01:21.783' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (6, N'b@b.ru', N'bbb@b.ru', 11, CAST(N'2017-02-07 14:01:29.310' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (7, N'b@b.ru', N'bbb@b.ru', 22, CAST(N'2017-02-07 14:01:37.050' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (8, NULL, N'bbb@b.ru', 0, CAST(N'2017-02-07 14:02:10.643' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (9, NULL, N'aaa@a.ru', 0, CAST(N'2017-02-07 14:02:22.160' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (10, N'anew@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 15:30:01.637' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (11, N'anew@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 15:30:06.657' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (12, N'anew@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 15:30:12.160' AS DateTime))

我需要选择出现 3 次或更多次且日期较新的记录(EmailRawEmailStatus 字段)。在这张表中是

'anew@a.ru | aaa@a.ru | 11'

而不是

'a@a.ru | aaa@a.ru | 11'

因为anew@a.ru 的日期更近

执行此选择的查询:

select * from
(
   select email, rawEmail, Status, 
   ROW_NUMBER() OVER(PARTITION BY rawEmail ORDER BY vdate DESC) num 
   from
      (select max([ValidationDate]) vdate, email, rawEmail, Status
         from TestToDelete where status in (11, 22)
         group by rawEmail, email, status 
         having count(*) > 2
      ) tmp 
)final where num = 1

是否有可能用更少的子查询(不是现在的 3 个)来做到这一点?


更新: 出现 3 个或更多的预期输出:

anew@a.ru | aaa@a.ru | 11

出现 2 个或更多的预期输出:

anew@a.ru | aaa@a.ru | 11
b@b.ru | bbb@b.ru | 11

【问题讨论】:

  • 您的意思是 3 条或更多条具有最新日期时间的记录吗?
  • @vkp 3 个或更多相同的记录,其中一个具有最新的日期时间。
  • @vkp 如果一个元组的日期为 2010、2011、2016 并且第二个元组的日期为 1997、1998 和 2017 - 我需要第二个,因为 2017 年更新。
  • 请显示您的预期输出,是一条记录还是总共 3 条记录
  • 你已经尽可能接近了。您必须对组子集进行排名并选择排名最高的行。由于您不能在一个查询中同时执行这两项操作,因此您至少需要两个查询。

标签: sql sql-server group-by


【解决方案1】:

使用with (common table expression)row_number()、 和count() over()

如果我们可以将count() 划分为RawEmail, Status 那么:

;with cte as (
    select
        rn = row_number() over (
            partition by RawEmail
            order by ValidationDate desc
            )
        , cnt = count(*) over (
            partition by RawEmail, status
            )
        , *
    from TestToDelete
    where status in (11, 22)
    )
select * 
from cte o 
where o.rn=1 
  and o.cnt > 2

结果:http://rextester.com/WYVZ86149

+----+-----+----+-----------+----------+--------+---------------------+
| rn | cnt | id |   Email   | RawEmail | Status |   ValidationDate    |
+----+-----+----+-----------+----------+--------+---------------------+
|  1 |   7 | 12 | anew@a.ru | aaa@a.ru |     11 | 07.02.2017 15:30:12 |
+----+-----+----+-----------+----------+--------+---------------------+

如果我们不能将count() 划分为RawEmail, Status 那么:

;with cte as (
    select
        rn = row_number() over (
            partition by RawEmail
            order by ValidationDate desc
            )
        , cnt = count(*) over (
            partition by RawEmail
            )
        , *
    from TestToDelete
    where status in (11, 22)
    )
select * 
from cte o 
where o.rn=1 
  and o.cnt > 2
  and exists (
    select 1
      from cte i 
      where i.RawEmail = o.RawEmail
        and i.Email != o.Email
      )

结果:http://rextester.com/YTQ30810

+----+-----+----+-----------+----------+--------+---------------------+
| rn | cnt | id |   Email   | RawEmail | Status |   ValidationDate    |
+----+-----+----+-----------+----------+--------+---------------------+
|  1 |   7 | 12 | anew@a.ru | aaa@a.ru |     11 | 07.02.2017 15:30:12 |
+----+-----+----+-----------+----------+--------+---------------------+

【讨论】:

  • 非常感谢。第一个查询完全符合我的要求
【解决方案2】:

试试这个:

;WITH CTE AS (
   SELECT *,
          RANK() OVER (PARTITION BY RawEmail
                       ORDER BY ValidationDate DESC) AS rn,
          COUNT(*) OVER (PARTITION BY Email, Status) AS cnt
   FROM TestToDelete
) 
SELECT *
FROM CTE 
WHERE rn = 1 AND cnt >= 3
ORDER BY ValidationDate DESC

查询使用了一个通用表表达式,它采用了窗口函数:

  • RANK 用于获取最新的每条 RawEmail 记录或记录(以防平局)
  • COUNT 用于确定每个 Email, Status 切片的人口

【讨论】:

  • 它只返回前 1 个。我需要所有的元组,出现 3 次或更多次。但是如果这个元组具有相同的RawEmail - 我想使用最近日期的元组
【解决方案3】:

我相信这将满足您的需求,而无需执行 CTE 或 ROW_NUMBER()

SELECT TOP 1 email, rawEmail, Status
FROM TestToDelete 
WHERE status IN (11, 22)
GROUP BYrawEmail, email, status 
HAVING COUNT(*) > 2
ORDER BY MAX(validationDate) DESC

【讨论】:

  • 但是前 1 只返回一行。我需要所有出现 3 次或更多次的行,如果是两个相同的元组,我想要最近的
  • 您需要编辑您的问题。它仅显示返回 1 条记录。
【解决方案4】:

如果我理解正确,您当前的查询并没有按照您的意思进行。您需要考虑日期。要获取在最近日期具有 3 个相同状态记录的行:

select ttd.*
from (select ttd.*,
             count(*) over (partition by email, rawemail, cast(ValidationDate as date)) as cnt,
             rank() over (partition by email, rawemail order by cast(ValidationDate as date) desc) as seqnum
      from TestToDelete ttd
      where status in ('11', '22')
     ) ttd
where seqnum = 1 and cnt >= 3;

如果您只想要电子邮件和状态,请使用:

select distinct email, rawemail, status

编辑:

我突然想到你想知道最近的三个记录是否都具有相同的状态。这更容易:

select email, rawemail, max(status)
from (select ttd.*,
             row_number() over (partition by email, rawemail         
                                 order by ValidationDate desc) as seqnum
      from TestToDelete ttd
      where status in ('11', '22')
     ) ttd
where seqnum <= 3
group by email, rawemail
having min(status) = max(status)

【讨论】:

  • 我当前的查询完全符合我的要求。但我想尽量减少子查询的数量
猜你喜欢
  • 2017-03-15
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2023-01-13
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多