【发布时间】:2022-01-14 03:53:58
【问题描述】:
我正在尝试填充两个表:
token:
word | df(the number of documents containing a word)
==========
"dog" | 5
"cat" | 2
"horse"| 1
token_count:
tokenid | docid| tf(the number of times a word occurs in a document)
====================
1 | 1 | 6
2 | 2 | 2
3 | 2 | 1
使用来自documents的数据:
id | title | body
=============================
1 | "dog" | "about dogs"
2 | "cats" | "about cats"
为此,我使用 ts_stat( 'select to_tsvector(''english'', body) from documents' ) 返回一个表格,其中包含该单词的文档频率以及该单词在整个列中出现的次数。虽然第二列正是我需要的 token 表,但第三列显示了整个列的文档频率。
word | ndoc | nentry
====================
dog | 5 | 6
cat | 2 | 2
horse| 1 | 1
此代码填充token 表并在 3 秒内完成一百个文档。
INSERT INTO token (word, document_frequency)
SELECT
word,
ndoc
FROM
ts_stat( 'select to_tsvector(''english'', body) from documents' );
我尝试在一个包含 15 个文档的较小数据集上运行以下代码,它可以工作,但是当我尝试在当前数据集(100 个文档)上运行它时,它永远不会停止运行。
WITH temp_data AS (
SELECT id ,
(ts_stat('select to_tsvector(''english'', body) from documents where id='||id)).*
FROM documents
)
INSERT INTO token_count (docid, tokenid, tf)
SELECT
id,
(SELECT id FROM token WHERE word = temp_data.word LIMIT 1),
nentry
FROM temp_data;
如何优化这个查询?
EXPLAIN ANALYZE 用于 15 个文档的数据集:
"Insert on token_count (cost=1023803.22..1938766428.23 rows=9100000 width=28) (actual time=59875.204..59875.206 rows=0 loops=1)"
" CTE temp_data"
" -> Result (cost=0.00..1023803.22 rows=9100000 width=44) (actual time=0.144..853.320 rows=42449 loops=1)"
" -> ProjectSet (cost=0.00..45553.23 rows=9100000 width=36) (actual time=0.142..809.366 rows=42449 loops=1)"
" -> Seq Scan on wikitable (cost=0.00..19.10 rows=910 width=4) (actual time=0.010..0.029 rows=16 loops=1)"
" -> CTE Scan on temp_data (cost=0.00..1937742625.00 rows=9100000 width=28) (actual time=0.509..59652.279 rows=42449 loops=1)"
" SubPlan 2"
" -> Limit (cost=0.00..212.92 rows=1 width=4) (actual time=1.381..1.381 rows=1 loops=42449)"
" -> Seq Scan on token (cost=0.00..425.84 rows=2 width=4) (actual time=1.372..1.372 rows=1 loops=42449)"
" Filter: ((word)::text = temp_data.word)"
" Rows Removed by Filter: 10384"
"Planning Time: 0.202 ms"
"Execution Time: 59876.350 ms"
EXPLAIN ANALYZE 用于 30 个文档的数据集:
"Insert on token_count (cost=1023803.22..6625550803.23 rows=9100000 width=28) (actual time=189910.438..189910.439 rows=0 loops=1)"
" CTE temp_data"
" -> Result (cost=0.00..1023803.22 rows=9100000 width=44) (actual time=0.191..2018.758 rows=92168 loops=1)"
" -> ProjectSet (cost=0.00..45553.23 rows=9100000 width=36) (actual time=0.189..1919.726 rows=92168 loops=1)"
" -> Seq Scan on wikitable (cost=0.00..19.10 rows=910 width=4) (actual time=0.013..0.053 rows=31 loops=1)"
" -> CTE Scan on temp_data (cost=0.00..6624527000.00 rows=9100000 width=28) (actual time=1.009..189412.022 rows=92168 loops=1)"
" SubPlan 2"
" -> Limit (cost=0.00..727.95 rows=1 width=4) (actual time=2.029..2.029 rows=1 loops=92168)"
" -> Seq Scan on token (cost=0.00..727.95 rows=1 width=4) (actual time=2.020..2.020 rows=1 loops=92168)"
" Filter: ((word)::text = temp_data.word)"
" Rows Removed by Filter: 16463"
"Planning Time: 0.234 ms"
"Execution Time: 189913.688 ms"
【问题讨论】:
-
How long is never?你可以添加一个
EXPLAIN ANALYZE,就像在这个问题中所做的那样:How do I increase the speed of my Postgres statement? -
昨天我尝试在 100 000 个文档的数据集上运行那段代码,一个半小时后不得不关闭它。刚刚为包含 15 个文档的数据集添加了
EXPLAIN ANALYZE。我现在将尝试对 30 个文档执行相同的操作。 -
在表
token的列word上添加索引可能会解决部分问题。 -
查询总是很慢,除非该字段被全文搜索索引覆盖。即使那样,您仍试图拆分并计算该索引中的每个单词,而不是找到特定的匹配项。这会很慢。
-
你想做什么?如果要对结果进行排名,可以使用
ts_rank
标签: sql postgresql tf-idf tsvector