【发布时间】:2022-01-15 15:57:28
【问题描述】:
我有一个token_count 表。
docid | tokenid | tf | log_ave_tf
1 | 1 | 1 | null
1 | 2 | 2 | null
2 | 1 | 3 | null
2 | 2 | 1 | null
ALTER TABLE token_count
ADD COLUMN log_ave_tf real;
我正在尝试计算log_ave_tf 列的值,公式如下:
log_ave_tf = (1 + log(tf)) / (1 + log(average tf for document))
这是我正在运行的代码:
UPDATE token_count tc
SET log_ave_tf = (1 + log(2, tf)) / (1+log(2, subquery.avg_tf))
FROM (
SELECT docid, avg(tf) as avg_tf
FROM token_count
GROUP BY docid
) subquery
WHERE subquery.docid = tc.docid;
在 1000 个文档数据集上运行了 1 分半钟,尝试在 100 000 个文档数据集(token_count 中的 3600 万行)上运行它,5 小时后不得不取消查询。我需要它来处理 400 万个文档数据集。有没有办法优化这个查询,所以它不会花费太多时间?
解释(分析、缓冲区、格式化文本) 1000 个文档的数据集:
"Update on token_count tc (cost=37563.77..92185.42 rows=1128913 width=94) (actual time=89287.844..89287.847 rows=0 loops=1)"
" Buffers: shared hit=2319962 read=13056 dirtied=17040 written=922"
" -> Hash Join (cost=37563.77..92185.42 rows=1128913 width=94) (actual time=768.179..83652.020 rows=1128913 loops=1)"
" Hash Cond: (tc.docid = subquery.docid)"
" Buffers: shared hit=32402 read=8796 dirtied=1 written=922"
" -> Seq Scan on token_count tc (cost=0.00..31888.13 rows=1128913 width=30) (actual time=0.089..702.652 rows=1128913 loops=1)"
" Buffers: shared hit=16206 read=4393 written=922"
" -> Hash (cost=37552.67..37552.67 rows=888 width=96) (actual time=767.982..767.983 rows=1001 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 93kB"
" Buffers: shared hit=16196 read=4403 dirtied=1"
" -> Subquery Scan on subquery (cost=37532.69..37552.67 rows=888 width=96) (actual time=766.111..767.517 rows=1001 loops=1)"
" Buffers: shared hit=16196 read=4403 dirtied=1"
" -> HashAggregate (cost=37532.69..37543.79 rows=888 width=36) (actual time=766.105..767.119 rows=1001 loops=1)"
" Group Key: token_count.docid"
" Batches: 1 Memory Usage: 321kB"
" Buffers: shared hit=16196 read=4403 dirtied=1"
" -> Seq Scan on token_count (cost=0.00..31888.13 rows=1128913 width=8) (actual time=0.010..231.895 rows=1128913 loops=1)"
" Buffers: shared hit=16196 read=4403 dirtied=1"
"Planning Time: 0.222 ms"
"Execution Time: 89288.014 ms"
【问题讨论】:
-
请使用为您的 SQL 语句运行 EXPLAIN 的结果更新您的问题
-
听起来您缺少索引。
-
您所做的更改是非规范化。您是否有理由不能将
log_ave_tf值存储在仅由docid键入的单独表中? -
不鼓励对关系进行大量就地更新,因为如果关系上有索引,HOT 更新会耗尽堆页面上的空间等结构性原因。视频:cybertec-postgresql.com/en/…
-
@dai 在这种情况下,什么样的索引和列适合?我会将
log_ave_tf列添加到document表中。
标签: sql postgresql