Postgresql - 全文搜索索引 - 意外查询结果答案

【问题标题】：Postgresql - full text search index - unexpected query resultsPostgresql - 全文搜索索引 - 意外查询结果
【发布时间】：2015-01-21 00:11:44
【问题描述】：

我有一张桌子，上面有一堆列我在这样的表上创建了全文索引：

CREATE INDEX phrasetable_exp_idx ON msc.mytable 
USING gin(to_tsvector('norwegian', coalesce(msc.mytable.col1,'') || ' ' || 
                 coalesce(msc.mytable.col2,'') || ' ' || 
                 coalesce(msc.mytable.col3,'') || ' ' ||
                 coalesce(msc.mytable.col4,'') || ' ' ||
                 coalesce(msc.mytable.col5,'') || ' ' ||
                 coalesce(msc.mytable.col6,'') || ' ' ||
                 coalesce(msc.mytable.col7,'')));

我尝试了一些搜索，它们的速度非常快，但是，对于一个特定的搜索，我没有得到预期的结果。我的表中有一行 col1 和 col2 的确切值“Importkompetanse Oslo AS” 在 col3 中，它的值为“9999”。只有查询 to_tsquery('9999') 返回该行，这表明它在 col1 和 col2 中确实具有值“Importkompetanse Oslo AS”，但前两个查询没有返回匹配项。

SELECT *
FROM msc.mytable
WHERE to_tsvector('norwegian', coalesce(msc.col1,'') || ' ' || 
                 coalesce(msc.mytable.col2,'') || ' ' || 
                 coalesce(msc.mytable.col3,'') || ' ' ||
                 coalesce(msc.mytable.col4,'') || ' ' ||
                 coalesce(msc.mytable.col5,'') || ' ' ||
                 coalesce(msc.mytable.col6,'') || ' ' ||
                 coalesce(msc.mytable.col7,'')));
@@ --to_tsquery('Importkompetanse&Oslo&AS') -- nada
   plainto_tsquery('Importkompetanse') -- nada
   --to_tsquery('9999') -- OK!

有人知道为什么我的搜索没有结果吗？

编辑：

由于某种原因，to_tsquery 返回如下内容： "'9999':9 'importkompetans':1,6" importkompetanse 这个词好像被删了？

但是，如果我将其设置为简单而不是挪威语，我会得到预期的结果，并且一切看起来都不错。这是为什么呢？

【问题讨论】：

标签： postgresql full-text-search full-text-indexing

【解决方案1】：

您在 tsvector 和 tsquery 值之间使用了交叉配置。您应该使用一致的配置，例如：

select to_tsvector('norwegian', 'Importkompetanse Oslo AS')
       @@ to_tsquery('norwegian', 'Importkompetanse&Oslo&AS');

SQLFiddle

这就是它与'simple' 配置（即您的default）一起工作的原因。

注意：您始终可以使用ts_debug() 进行debug 文本搜索：f.ex。 'Importkompetanse' 没有被截断，'importkompetans' 只是这个词的合适词位（在'norwegian' 配置中）。

关闭：您使用了一个非常长的、基于表达式的索引，只有在您在查询中使用精确表达式时才会使用该索引。您在示例中正确使用了它，但这会使您的查询非常长，如果您稍后更改索引表达式，则需要确保所有“使用”也更新。

您可以使用简单的 (sql) 函数来简化查询：

create or replace function col_tsvector(mytable)
  returns tsvector
  immutable
  language sql
  as $function$
return to_tsvector('norwegian',
  coalesce($1.col1, '') || ' ' || 
  coalesce($1.col2, '') || ' ' || 
  coalesce($1.col3, '') || ' ' ||
  coalesce($1.col4, '') || ' ' ||
  coalesce($1.col5, '') || ' ' ||
  coalesce($1.col6, '') || ' ' ||
  coalesce($1.col7, ''))
$function$;

这样，您也可以大大简化索引定义和查询。（你甚至可以使用attribute notation。）

【讨论】：

嗨，感谢您的宝贵意见，我是 postgres 的新手 :) 我不确定您所说的交叉配置是什么意思？我要做的是创建一个索引来匹配短语、所有单词或某些单词并按该顺序排列匹配项，我的索引会这样做吗？
而且importkompetans这个词在挪威语中不是一个有效的词，我们有复合词，importkompetanse是这里的有效词。 Importkomtetans 与英语中的 import Competnc 相同，但缺少最后一个字母。
@LarsAnundskås by cross 我的意思是，您应该对 to_tsvector() 和 to_tsquery() 调用使用完全相同的配置，就像在我的示例中一样（您省略了配置参数您对 to_tsquery() 的调用，这意味着 PostgreSQL 将对该调用使用默认配置） -- importkompetans 不是单词，它是（a）单词的词位：它有助于全文搜索引擎识别相似之处，如复数和屈折形式等。