如何改善 DBpedia Spotlight 的结果？答案

【问题标题】：How to improve the results from DBpedia Spotlight?如何改善 DBpedia Spotlight 的结果？
【发布时间】：2019-12-07 16:14:53
【问题描述】：

我正在使用 DBpedia Spotlight 提取 DBpedia 资源，如下所示。

import json
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse

## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = "Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients.     In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions."
CONFIDENCE = '0.5'
SUPPORT = '10'
REQUEST = BASE_URL.format(
    text=urllib.parse.quote_plus(TEXT), 
    confidence=CONFIDENCE, 
    support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []

r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
    all_urls.append(res['@URI'])
print(all_urls)

我的文字如下所示：

常春藤提取物在临床实践条件下对炎症性支气管疾病的耐受性、安全性和有效性：一项针对 9657 名患者的前瞻性、开放、多中心上市后研究。在这项上市后研究中，9657 名患有支气管炎（急性或慢性支气管炎性疾病）的患者（5181 名儿童）接受了含有干常春藤叶提取物的糖浆治疗。治疗 7 天后，95% 的患者症状改善或痊愈。治疗的安全性非常好，不良事件的总发生率为 2.1%（主要是胃肠道疾病，发生率为 1.5%）。在那些同时服用药物的患者中，可以证明额外应用抗生素对疗效没有好处，但确实将发生副作用的相对风险增加了 26%。总之，就是说干常春藤叶提取物对支气管炎患者有效且耐受性良好。鉴于所考虑的人群众多，未来的分析应针对不同年龄组的治疗、伴随治疗和基线条件等具体问题进行分析。

我得到的结果如下。

['http://dbpedia.org/resource/Hedera', 
'http://dbpedia.org/resource/Helix', 
'http://dbpedia.org/resource/Bronchitis', 
'http://dbpedia.org/resource/Cough_medicine',
'http://dbpedia.org/resource/Hedera', 
'http://dbpedia.org/resource/After_7',
'http://dbpedia.org/resource/Gastrointestinal_tract',
'http://dbpedia.org/resource/Antibiotics',
'http://dbpedia.org/resource/Relative_risk',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Bronchitis']

如您所见，结果不是很好。

例如，考虑上面提到的文本中的Hedera helix extract。尽管 DBpedia 有 Hedera helix (http://dbpedia.org/resource/Hedera_helix) 的资源，但 Spotlight 将其作为两个 URI 输出，分别为 http://dbpedia.org/resource/Hedera 和 http://dbpedia.org/resource/Helix。

根据我的数据集，我想得到 DBpedia 中最长的词作为结果。在这种情况下，我可以做哪些改进来获得我想要的输出？

如果需要，我很乐意提供更多详细信息。

【问题讨论】：

对结果进行后处理，或者在您自己的数据集上进行训练，或者使用其他工具甚至多个工具。一般来说，解决这个问题并非易事
@AKSW 感谢您的评论。您对我可以尝试的其他工具或我可以在这方面使用的任何后处理技术有什么建议吗？我期待着您的回音。非常感谢:)
不，那是 NLP，不是我的主题。名词短语检测然后链接到 DBpedia 是您的极端案例所需要的。像往常一样，极端情况可能很棘手，NLP 从句子检测等基本步骤开始，到 pos 标记，然后是 NP 检测等等。因此，任何先前的错误都会影响后面的步骤
@AKSW 非常感谢。当然，我会看看你提到的领域:)
pyspotlight 可能感兴趣。虽然它可能不会提高识别率，但至少你会编写更少的代码。它还返回比您上面的代码更多的结果。

标签： sparql wikipedia dbpedia linked-data spotlight-dbpedia

【解决方案1】：

虽然我回答这个问题的时间很晚，但是您可以在 python 中使用 Babelnet API 来获取包含更长文本的 dbpedia URI。我使用下面的代码重现了这个问题：

`from babelpy.babelfy import BabelfyClient

text ="Tolerance, safety and efficacy of Hedera helix extract in inflammatory 
bronchial diseases under clinical practice conditions: a prospective, open, 
multicentre postmarketing study in 9657 patients.     In this postmarketing 
study 9657 patients (5181 children) with bronchitis (acute or chronic 
bronchial inflammatory disease) were treated with a syrup containing dried ivy 
leaf extract. After 7 days of therapy, 95% of the patients showed improvement 
or healing of their symptoms. The safety of the therapy was very good with an 
overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders 
with 1.5%). In those patients who got concomitant medication as well, it could 
be shown that the additional application of antibiotics had no benefit 
respective to efficacy but did increase the relative risk for the occurrence 
of side effects by 26%. In conclusion, it is to say that the dried ivy leaf 
extract is effective and well tolerated in patients with bronchitis. In view 
of the large population considered, future analyses should approach specific 
issues concerning therapy by age group, concomitant therapy and baseline 
conditions."

# Instantiate BabelFy client.
params = dict()
params['lang'] = 'english'
babel_client = BabelfyClient("**Your Registration Code For API**", params)

# Babelfy sentence.
babel_client.babelfy(text)


# Get all merged entities.
babel_client.all_merged_entities'

对于文本中的所有合并实体，输出将采用如下所示的示例格式。您可以进一步存储和处理字典结构以提取 dbpedia URI。

{'start': 34,
'end': 45,
'text': 'Hedera helix',
'isEntity': True,
'tokenFragment': {'start': 6, 'end': 7},
'charFragment': {'start': 34, 'end': 45},
'babelSynsetID': 'bn:00021109n',
'DBpediaURL': 'http://dbpedia.org/resource/Hedera_helix',
'BabelNetURL': 'http://babelnet.org/rdf/s00021109n',
'score': 1.0,
'coherenceScore': 0.0847457627118644,
'globalScore': 0.0013494092960806407,
'source': 'BABELFY'},

【讨论】：