PostgreSQL 的 to_tsvector 函数可以返回标记/单词而不是词位吗？答案

【问题标题】：Can PostgreSQL's to_tsvector function return tokens/words and not lexemes?PostgreSQL 的 to_tsvector 函数可以返回标记/单词而不是词位吗？
【发布时间】：2017-10-11 11:26:41
【问题描述】：

PostgreSQL 的 to_tsvector 函数非常有用，但就我的数据集而言，它比我想要的要多一些。

例如：

select * 
from to_tsvector('english', 'This is my favourite game. I enjoy everything about it.');

产生：'enjoy':7 'everyth':8 'favourit':4 'game':5

我并不担心停用词会被过滤掉，这很好。但是有些词会被完全毁掉，比如everything 和favourite。

有没有办法修改这种行为，或者有不同的功能可以做到这一点？

PS：是的，我可以编写自己的查询来执行此操作（并且我有），但我想要一个更快的方法。

【问题讨论】：

标签： postgresql nlp lemmatization

【解决方案1】：

你看到的并且你不想要的行为是“词干”。如果您不希望这样，则必须使用带有 to_tsvector 的不同字典。 “简单”字典不进行词干提取，因此它应该适合您的用例。

select * 
from to_tsvector('simple', 'This is my favourite game. I enjoy everything about it.');

产生以下输出

'about':9'enjoy':7'everything':8'favourite':4'game':5'i':6'is':2'it':10'my':3'this ':1

如果您仍想删除停用词，就我所见，您必须定义自己的字典。请参阅下面的示例，尽管您可能需要阅读 documentation 以确保这完全符合您的要求。

CREATE TEXT SEARCH DICTIONARY only_stop_words (
    Template = pg_catalog.simple,
    Stopwords = english
);
CREATE TEXT SEARCH CONFIGURATION public.only_stop_words ( COPY = pg_catalog.simple );
ALTER TEXT SEARCH CONFIGURATION public.only_stop_words ALTER MAPPING FOR asciiword WITH only_stop_words;
select * 
from to_tsvector('only_stop_words', 'The This is my favourite game. I enjoy everything about it.');

'enjoy':8'everything':9'favourite':5'game':6

【讨论】：

我明白了。有没有去除停用词但不做词干提取的字典？
@Petar 我添加了另一个删除停用词的变体
那是完美的。谢谢！