【问题标题】:find initialisms from abbreviations in a text从文本中的缩写中查找首字母缩写词
【发布时间】:2020-04-13 11:25:45
【问题描述】:

我有一个首字母缩略词列表,我想做的是在文本中找到它们的定义,然后将它们放入字典中。我已经编写了一个代码,但是是硬编码的并且不会产生想要的结果。我希望我的最终结果是这样的。

 {'NBA': ' National Basketball Association', 'NCAA': 'National Collegiate Athletic Association'}

代码:

dict = {}
full_form = ' '
s = " NBA  comes from the words National Basketball Association is a men's professional basketball league in North America, composed of 30 teams. On the other hand NCAA stands for The National Collegiate Athletic Association"


acro = ['NBA', 'NCAA']

for char in range(len(acro)):
    for n,word in enumerate (list_str):
        if acro[char][0] == word[0] and word not in acro:
            full_form += word + ' '
            print(full_form)
            if acro[char][1] == list_str[n+1][0] and word not in acro:
                print(list_str[n+1])
                full_form += list_str[n+1] + ' '
                if acro[char][2] == list_str[n+2][0] and word not in acro:
                    full_form += list_str[n+2] + ''
                    d[acro[char]] = full_form
print(d)
out: {'NBA': ' National Basketball Association', 'NCAA': ' National Basketball AssociationNorth National National North National Collegiate Athletic'}

任何有关如何在 pythonic wat 中实现预期结果的帮助将不胜感激。

【问题讨论】:

  • 你可以在那里应用正则表达式。
  • 你想让你的代码理解任意文本的定义吗?如果是这样那就是一个 ML\DS 主题,查找 Named Entity Recognition,但这并不容易。
  • ^ 是的,我愿意。你认为只有 NLP 才能做到?

标签: python python-3.x string list dictionary


【解决方案1】:

以下是如何应用正则表达式的简单示例:

import re

s = " NBA  comes from the words National Basketball Association is a men's professional basketball league in North America, composed of 30 teams. On the other hand NCAA stands for The National Collegiate Athletic Association"
acro = ['NBA', 'NCAA', 'STFU']

patterns = [f'({a}).+?({" ".join(c + "[a-z]+" for c in a)})(?: |$)' for a in acro]
# python 3.8
result = dict(m.groups() for p in patterns if (m := re.search(p, s)))
# lower versions
result = dict(m.groups() for m in (re.search(p, s) for p in patterns) if m)

Here 是为'NCAA' 生成的正则表达式示例:

(NCAA).+(N[a-z]+ C[a-z]+ A[a-z]+ A[a-z]+)(?: |$)

【讨论】:

  • 谢谢。不过我无法测试它,因为我现在正在使用这个新的操作员在 3.8 上工作。
  • @hipocampus777,我已经为旧版 python 添加了选项。
  • 谢谢。它适用于缩写在完整形式之前的情况,而不是相反,如果完整形式在缩写之前。
  • @hipocampus777,是的,这就是为什么我说它是简单示例。主要思想是分享如何应用正则表达式来解决这个任务。你应该改进正则表达式,但主要思想不会改变。
猜你喜欢
  • 2023-04-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-21
相关资源
最近更新 更多