保持唯一的字符串的缩写答案

【问题标题】：Abbreviation of Strings that Remains Unique保持唯一的字符串的缩写
【发布时间】：2019-03-11 14:52:00
【问题描述】：

我有一个唯一的字符串列表（最初的想法是表中的列名）。任务是执行列表的最大可能缩写，因此列表保持不同。

例如AAA, AB可以缩写为AA, AB。（但不是A, AB - 因为A 可能是AAA 和AB 的前缀）。 AAAA, BAAAA 可以缩短为 A, B。但是A1, A2根本不能缩写。

这里是示例数据

create table tab as 
select 'AAA' col from dual union all
select 'AABA' col from dual union all
select 'COL1' col from dual union all
select 'COL21' col from dual union all
select 'AAAAAA' col from dual union all
select 'BBAA' col from dual union all
select 'BAAAA' col from dual union all
select 'AB' col from dual;

预期的结果是

COL    ABR_COL                
------ ------------------------
AAA    AAA                      
AAAAAA AAAA                     
AABA   AAB                      
AB     AB                       
BAAAA  BA                       
BBAA   BB                       
COL1   COL1                     
COL21  COL2

我管理了一个由四个子查询组成的蛮力解决方案，我不是故意发布的，因为我希望存在一个更简单的解决方案，我不想分散注意力。

顺便说一句，r 中有一个类似的函数，称为abbreviate，但我正在寻找 SQL 解决方案。欢迎使用其他 RDBMS 的首选 Oracle 解决方案。

【问题讨论】：

标签： sql r oracle distinct-values

【解决方案1】：

我会在递归 CTE 中进行过滤：

with potential_abbreviations(col, abbr, lev) as (
      select col, col as abbr, 1 as lev
      from tab
      union all
      select pa.col, substr(pa.abbr, 1, length(pa.abbr) - 1) as abbr, lev + 1
      from potential_abbreviations pa
      where length(abbr) > 1 and
            not exists (select 1
                        from tab
                        where tab.col like substr(pa.abbr, 1, length(pa.abbr) - 1) || '%' and
                              tab.col <> pa.col
                       )
     )
select pa.col, pa.abbr
from (select pa.*, row_number() over (partition by pa.col order by pa.lev desc) as seqnum
      from potential_abbreviations pa
     ) pa
where seqnum = 1

Here 是一个 dbfiddle。

lev 严格来说不是必需的。您可以在order by 中使用length(abbr) desc。但是，当我使用递归 CTE 时，我通常会包含一个递归计数器，所以这是习惯。

在 CTE 中进行额外的比较可能看起来更复杂，但它简化了执行——递归在正确的值处停止。

这也在唯一的单个字母 col 值上进行了测试。

【讨论】：

【解决方案2】：

这实际上可以使用递归 CTE。我并没有真正得到它比三个子查询（加上一个查询）更短，但至少它不受字符串长度的限制。步骤大致如下：

使用递归 CTE 计算所有可能的缩写。这将选择所有列命名自己，然后递归地缩短一个字母的列名：

表：

 col    abbr
 --- -------
 AAA    AAA
 AAA    AA
 AAA    A
 ...

对于每个缩写，计算它出现的频率

表格

ABBR    CONFLICT
----    --------
AA      3
AAA     2
AABA    1
...

选择唯一最短的缩写，以及只是列名本身的缩写，并按缩写的长度对它们进行排名。在示例中，您会看到 AAA 与其他一些缩写冲突，但仍必须选择它，因为它等于未缩短的名称。

表格

COL     ABBR    CONFLICT    POS
-------------------------------
AAA     AAA     2           1
AAAAAA  AAAA    1           1
AAAAAA  AAAAA   1           2
AAAAAA  AAAAAA  1           3
AABA    AAB     1           1
...

为每列选择排名第一的缩写（或列名本身）。

表格

COL     ABBR    POS
-------------------
AAA     AAA     1
AAAAAA  AAAA    1
AABA    AAB     1
...

完整的 SQL

这会产生以下 SQL，将上述步骤作为 CTE：

with potential_abbreviations(col,abbr) as (
  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations
  where length(abbr) > 1
)
, abbreviation_counts as (
  select abbr
       , count(*) as conflict
  from potential_abbreviations
  group by abbr
)
, all_unique_abbreviations(col,abbr,conflict,pos) as (
select
    p.col
  , p.abbr
  , conflict
  , rank() over (partition by col order by p.abbr) as pos
  from potential_abbreviations p
    join abbreviation_counts c on p.abbr = c.abbr
    where conflict = 1 or p.col = p.abbr
)
select col, abbr, pos
from all_unique_abbreviations
where pos = 1
 order by col, abbr

结果

COL     ABBR
------- ----
AAA     AAA
AAAAAA  AAAA
AABA    AAB
AB      AB
AC1     AC
AD      AD
BAAAA   BA
BBAA    BB
COL1    COL1
COL21   COL2

SQL Fiddle

【讨论】：

为递归CTE加一个，这样可以节省一个子查询。
请给其他人一些时间来尝试，正如我所说的，我正在寻找最简单可能的解决方案。
有道理 :) 也许我可以再删除一个 CTE :)

【解决方案3】：

我找到了第二种方法，没有添加到第一个答案中，因为它更短且不同。步骤如下：

递归计算每个名称的所有潜在缩写

SQL

  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations a
  where length(abbr) > 1

结果

 col    abbr
 --- -------
 AAA    AAA
 AAA    AA
 AAA    A
 ...

然后计算缩写之间的冲突。还要跟踪导致此缩写的列名。我们只想保留不会引起冲突的缩写，因此无需考虑 min() 聚合。

SQL

select
    abbr
  , count(*) as conflicts
  , min(col) as best_candidate
  from potential_abbreviations
 group by abbr
having count(*) = 1

结果

ABBR    CONFLICTS BEST_CANDIDATE
------- --------- ---------------
AAAA    1         AAAAAA
AAAAA   1         AAAAAA
AAAAAA  1         AAAAAA
AAB     1         AABA
AABA    1         AABA
...

最后，将潜在缩写与最佳无冲突候选者进行左连接，如果没有无冲突解决方案，则仅使用列名：

SQL

select
    p.col as col
  , nvl(min(c.abbr), p.col) as abbr
  from potential_abbreviations p
  left join conflict_free c on p.col = c.best_candidate
 where c.conflicts = 1 or p.abbr = p.col
 group by p.col
  order by col, abbr

完整的 SQL

with potential_abbreviations(col,abbr) as (
  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations a
 where length(abbr) > 1
)
, conflict_free as (
    select
        abbr
      , count(*) as conflicts
      , min(col) as best_candidate
      from potential_abbreviations
     group by abbr
    having count(*) = 1
)
select
    p.col as col
  -- , c.best_candidate
  , nvl(min(c.abbr), p.col) as abbr
  -- , min(c.abbr) over (partition by c.best_candidate) shortest
  from potential_abbreviations p
  left join conflict_free c on p.col = c.best_candidate
 where c.conflicts = 1 or p.abbr = p.col
 group by p.col, c.best_candidate
 order by col, abbr

结果

COL     ABBR
------- ----
AAA     AAA
AAAAAA  AAAA
AABA    AAB
AB      AB
AC1     AC
AD      AD
BAAAA   BA
BBAA    BB
COL1    COL1
COL21   COL2

SQL Fiddle

注意：对于 Postgresql，递归 CTE 必须是 with recursive，而 Oracle 根本不喜欢 recursive 这个词。

【讨论】：