如何使用 RASA NLU 提取不是相邻单词的实体答案

【问题标题】：how to extract entity with RASA NLU which are not adjacent words如何使用 RASA NLU 提取不是相邻单词的实体
【发布时间】：2018-10-17 12:32:09
【问题描述】：

https://github.com/RasaHQ/rasa_nlu/issues/1468#issue-370187480

Rasa NLU 版本：0.13.6

操作系统（windows、osx、...）：windows

模型配置文件内容： yml

language: "en"

pipeline:
- name: tokenizer_whitespace
- name: intent_entity_featurizer_regex
- name: ner_crf
- name: ner_synonyms
- name: intent_featurizer_count_vectors
- name: intent_classifier_tensorflow_embedding
  intent_tokenization_flag: true
  intent_split_symbol: "+"
path: ./models/nlu
data: ./data/training_nlu.json

问题：

如何提取实体。不是相邻的词。下面是一个例子：

我需要训练我的 NLU 了解公众的不满，例如 STREET LIGHT OUT、POTHOLE IN STREET、STREET LIGHTS ON DAYS

我的实体值为 STREET LIGHT OUT ，这意味着有人想报告路灯不工作。他/她将按照以下格式进行。

班加罗尔 42 号 Ulsoor 路 WH Hanumanthappa Layout 的 Vasanth Shetty 医生诊所附近的路灯已融合一周以来。

路灯单独不是一个实体或融合单独不是我的实体。路灯融合是同义词。有没有可能，训练 NLU 从这句话中提取融合的路灯。如果是的话怎么做。

如果不是，那么拆分路灯并融合为不同的实体是唯一的解决方案吗？但是可以从上面的句子中提取 street light fused ，因为它可以提取其中包含多个单词的实体，并且 tokenizer_whitespace 只是在空白处中断。

请建议是否有更好的方法来获取我的实体而无需拆分为多个实体。

这里我有更多关于同一问题的示例：

示例 1：

过去 10 天内没有捡到的垃圾，需要立即清理。

这里我可以挑出垃圾未选是问题所在。我可以训练我的 NLU 用 ner_crf 提取这个命名实体，训练如下 sn-p { "text": "Garbage not picked from past 10 days,need immediate attention for clearance", "intent": "inform_grevience", "entities": [ { "start": 20, "end": 38, "value": "Garbage not picked", "entity": "issue" } ] }

示例 2：

在过去 10 天内，第 10 个主站附近的垃圾未挑选，需要立即采取行动

不同的公民报告相同的问题但不同的句子。

我是否也可以使用 ner_crf 提取示例 2 中未提取的垃圾？

【问题讨论】：

标签： rasa-nlu

【解决方案1】：

我将提出两种替代方法，它们都依赖于意图。我相信您提供的话语中唯一的实体是地址信息。

因此，您可以将每个示例训练为完全不同的意图（不包括实体）：

## intent:streetLightOut
- The Street light adjacent to Dr Vasanth Shetty's Clinic , WH Hanumanthappa     Layout, Ulsoor Road, Bangalore 42 is out.
- I'd like to report a street light that is burnt out
- street light out

## intent:streetLightAlwaysOn
- The Street light adjacent to Dr Vasanth Shetty's Clinic , WH Hanumanthappa     Layout, Ulsoor Road, Bangalore 42 is always on.
- I'd like to report a street light that never turns off
- street light on constantly

## intent:potholeInStreet
- There's a pothole at the intersection of 10th and main
- pothole
- pothole on 11th street near Wal-Mart

另外，由于您使用的是张量流，因此您可以使用 heirarcachal 意图：

## intent:streetLight+out
- The Street light adjacent to Dr Vasanth Shetty's Clinic , WH Hanumanthappa Layout, Ulsoor Road, Bangalore 42 is out.
- I'd like to report a street light that is burnt out
- street light out

## intent:streetLight+alwaysOn
- The Street light adjacent to Dr Vasanth Shetty's Clinic , WH Hanumanthappa     Layout, Ulsoor Road, Bangalore 42 is always on.
- I'd like to report a street light that never turns off
- street light on constantly

## intent:potHole
- There's a pothole at the intersection of 10th and main
- pothole
- pothole on 11th street near Wal-Mart

我建议这些方法的主要原因是 Rasa 中的实体是高度定位的，对单词的重视程度很低（并且不包含单词向量）。由于路灯的所有问题都可能包含这些词或类似词，因此这些词本身似乎最有价值。

这篇博文包含一些关于 TensforFlow 和层次意图的信息：https://medium.com/rasa-blog/supervised-word-vectors-from-scratch-in-rasa-nlu-6daf794efcd8

【讨论】：