【发布时间】:2015-09-09 06:01:24
【问题描述】:
我有一个 PostgreSQL 9.3 的“字典”。我想获取所有术语,在页面之间将它们分成三倍(前三个字符),每页最多 30 个术语。因此,页面之间不应有任何三元组,例如第一页应包含术语“aaa”到“aaf”,第二个 --- “aag”到“aan”,但任何页面都不应包含“三元组”。
到目前为止我有这个查询:
WITH results AS (
WITH terms AS (
WITH triples AS (
-- 1. triples with cumulative numbers of appearances:
SELECT
LOWER(substring("term" FROM 1 FOR 3)) AS triple,
ROW_NUMBER() OVER(PARTITION BY LOWER(substring("term" FROM 1 FOR 3))) AS rnum
FROM terms
GROUP BY triple, "term"
)
-- 2. GROUPs by rnum, removes triple duplicates:
SELECT
triples.triple,
MAX(triples.rnum) AS amount
FROM triples
GROUP BY triples.triple
)
-- 3. makes { "triple": triple, "amount": amount },
-- assigns "page number" (~30 per page):
SELECT
COALESCE(substring(terms.triple FROM 1 FOR 1), '') AS first,
('{ "triple": "' || COALESCE(terms.triple, '') || '", "amount": ' || terms.amount || ' }')::json AS terms,
(sum((terms.amount)::int) OVER (ORDER BY terms.triple)) / 30 AS chunk
FROM terms
GROUP BY first, terms.triple, terms.amount
ORDER BY first, terms.triple
)
-- 4. collects "page" triples into rows:
SELECT
first,
COALESCE(json_agg(results.terms), ('{ "triple" :' || NULL || ', "amount":' || 1 || '}')::json) AS triplesdata,
sum((results.terms->>'amount')::int) AS sum,
chunk
FROM results
GROUP BY first, chunk
ORDER BY results.first, json_agg(results.terms)->0->>'triple'
需要明确的是,SELECT #1 给了我:
triple | rnum
--------+------
аар | 1
аба | 1
абе | 1
абе | 2
аби | 1
аби | 2
абл | 1
...
SELECT #2 给了我所有的三元组和以它们开头的单词数量:
triple | amount
--------+--------
аар | 1
аба | 1
абе | 2
аби | 2
абл | 1
або | 1
абс | 1
...
SELECT #3 给了我几乎相同的信息,但三元组现在是 jsons 并添加了块编号列:
first | terms | chunk
-------+----------------------------------+-------
а | { "triple": "аар", "amount": 1 } | 0
а | { "triple": "аба", "amount": 1 } | 0
а | { "triple": "абе", "amount": 2 } | 0
а | { "triple": "аби", "amount": 2 } | 0
а | { "triple": "абл", "amount": 1 } | 0
а | { "triple": "або", "amount": 1 } | 0
а | { "triple": "абс", "amount": 1 } | 0
...
整个查询给了我:
first | triplesdata | sum | chunk
-------+-----------------------------------------------+-----+-------
а | [{ "triple": "аар", "amount": 1 } ...(others) | 28 | 0
a | [{ "triple": "аве", "amount": 5 } ...(others) | 30 | 1
...
д | [{ "triple": "доб", "amount": 69 }, ... | 89 | 138
...
我可以处理这个;但是有些块包含太多数据 --- 一些三元组应该分解为“四元组”,并更深地分解为“多组”。
我编写了 Python 脚本,它递归地完成这项工作。
但我很投入:是否有可能在 PostgreSQL 中完成这种递归工作?
还有一个问题 --- 哪个索引(-es?)最适合 terms.term 列?
还有一个问题:我做错了什么? --- 我对 sql 有点陌生。
更新:到目前为止没有接受的答案,因为我的问题没有答案。是的,我现在正在使用 python 脚本。但我想得到一些答案。
【问题讨论】:
-
我认为您可以通过使用额外提供的模块“pg_trgm”中的函数 show_trgm(text) 来大大简化您的查询:postgresql.org/docs/9.1/static/pgtrgm.html
-
谢谢,我去那边挖。
-
无法弄清楚如何简化。(
-
你不需要“嵌套”CTE,你可以一个接一个地写
with cte_1 as ( ...), cte_2 as (...), cte_3 as (...) select ...CTEs support recursive queries也许这就是你要找的。您能否为所涉及的表发布完整的create table语句,包括一些示例数据(最好是insert into语句) -
在 Pgadmin 的 SQL 窗口中使用这种嵌套形式,我可以选择(我的意思是鼠标)并从最里面到最外面执行
select's。谢谢,稍后我会尝试准备并在这里发布一些测试数据。
标签: json postgresql postgresql-9.3 recursive-query window-functions