【问题标题】:Extracting integers from a list从列表中提取整数
【发布时间】:2017-07-07 06:20:30
【问题描述】:

我有一个这样的列表:

fmt_string="I am a smoker male of 25 years who wants a policy for 30 
yrs with a sum assured amount of 1000000 rupees"

从上面的列表中我删除了停用词并得到了这个 现在我有一个列表如下:

['smoker', 'male', '25', 'years', 'wants', 'policy', '30', 'yrs', 
'sum', 'assured', 'amount', '1000000', 'rupees']

我想从这个列表中只提取 25、30 和 1000000,但代码应该类似于 25 之前或之后的年份。 30 可以在策略之后,1000000 可以在任何位置

最终输出应该是这样的:

'1000000 30 25 male smoker'

我只想要一个健壮的代码,无论我在哪里找到这些值,我都会返回一个这样的列表。

【问题讨论】:

  • 请写下你的代码。
  • 你有没有尝试解决这个问题?
  • 为了过滤字符串,只获取整数值,您可以使用该行代码:integer_values = [e for e in fmt_string.split() if isinstance(e, int)]
  • 正则表达式怎么样! re.findall(r'\d+', fmt_string)
  • 对于 d 中的值:如果 value == 'male':打印 value+1

标签: python python-2.7 nlp


【解决方案1】:

这应该会有所帮助

import re
# Variation in places of the numbers in strings:
str1 = "I am a smoker male of 25 years who wants a policy for 30  yrs with a sum assured amount of 1000000 rupees"
str2 = "I am a smoker male of 25 years who wants a for 30 policy  yrs with a sum assured amount of 1000000 rupees"
str3 = "I am a smoker male of years 25 who wants a for 30 policy yrs with a sum assured amount of 1000000 rupees"
str4 = "I am a smoker male of 25 years who wants a for 30 policy yrs with a sum assured amount of 1000000 rupees"

regex = r".*?(((\d{2})\s?years)|(years\s?(\d{2}))).*(policy.*?(\d{2})|(\d{2}).*?policy).*(\d{7}).*$"
replacements = r"\9 \7 \8 \3 \5"

res_str1 = re.sub(regex, replacements, str1)
res_str2 = re.sub(regex, replacements, str2)
res_str3 = re.sub(regex, replacements, str3)
res_str4 = re.sub(regex, replacements, str4)


def clean_spaces(string):
    return re.sub(r"\s{1,2}", ' ', string)


print(clean_spaces(res_str1))
print(clean_spaces(res_str2))
print(clean_spaces(res_str3))
print(clean_spaces(res_str4))

输出:

1000000 30 25 
1000000 30 25 
1000000 30 25
1000000 30 25

更新

上面的正则表达式有一些错误。当我试图改进它时,我注意到它效率低下且丑陋,因为它每次都解析每个字符。如果我们坚持您原来的解析单词的方法,我们可以做得更好。所以我的新解决方案是:

# Algorithm
# for each_word in the_list:
#     maintain a pre_list of terms that come before a number
#     if each_word is number:
#         if there is any element of desired_terms_list exists in pre_list:
#             pair the number & the desired_term and insert into the_dictionary
#             remove this desired_term from desired_terms_list
#             reset the pre_list
#         else:
#             put the number in number_at_hand
#     else:
#         if no number_at_hand:
#             add the current word into pre_list
#         else:
#             if the current_word an element of desired_terms_list:
#                 pair the number & the desired_term and insert into the_dictionary
#                 remove this desired_term from desired_terms_list
#                 reset number_at_hand

代码:

from pprint import pprint


class Extractor:
    def __init__(self, search_terms, string):
        self.pre_list = list()
        self.list = string.split()
        self.terms_to_look_for = search_terms
        self.dictionary = {}

    @staticmethod
    def is_number(string):
        try:
            int(string)
            return True
        except ValueError:
            return False

    def check_pre_list(self):
        for term in self.terms_to_look_for:
            if term in self.pre_list:
                return term
            else:
                return None

    def extract(self):
        number_at_hand = str()
        for word in self.list:
            if Extractor.is_number(word):
                check_result = self.check_pre_list()
                if check_result is not None:
                    self.dictionary[check_result] = word
                    self.terms_to_look_for.remove(check_result)
                    self.pre_list = list()
                else:
                    number_at_hand = word
            else:
                if number_at_hand == '':
                    self.pre_list.append(word)
                else:
                    if word in self.terms_to_look_for:
                        self.dictionary[word] = number_at_hand
                        self.terms_to_look_for.remove(word)
                        number_at_hand = str()
        return self.dictionary

用法:

ex1 = Extractor(['years', 'policy', 'amount'],
                'I am a smoker male of 25 years who wants a policy for 30 yrs with a sum assured amount of 1000000 rupees')
ex2 = Extractor(['years', 'policy', 'amount'],
                'I am a smoker male of 25 years who wants a for 30 yrs policy with a sum assured amount of 1000000 rupees')
ex3 = Extractor(['years', 'policy', 'amount'],
                'I am a smoker male of years 25 who wants a policy for 30 yrs with a sum assured amount of 1000000 rupees')
ex4 = Extractor(['years', 'policy', 'amount'],
                'I am a smoker male of years 25 who wants a for 30 yrs policy with a sum assured amount of 1000000 rupees')
pprint(ex1.extract())
pprint(ex2.extract())
pprint(ex3.extract())
pprint(ex4.extract())

输出:

{'amount': '1000000', 'policy': '30', 'years': '25'}
{'amount': '1000000', 'policy': '30', 'years': '25'}
{'amount': '1000000', 'policy': '30', 'years': '25'}
{'amount': '1000000', 'policy': '30', 'years': '25'}

我希望现在有更好的表现。

【讨论】:

  • 抛出错误为:TypeError: 'str' object is not callable
  • 它(不是正则表达式)是否显示相同输入的错误?我得到错误!
  • Traceback(最近一次调用最后一次):文件“”,第 9 行,在 文件“”,第 27 行,提取类型错误:'str' 对象不是可调用
  • number_at_hand = str() 更改为number_at_hand = '' 有帮助吗?我不明白你为什么会收到这个错误。我为 python3.6 编写了这段代码,也使用 python2.7.12 运行。我没有错误。你有什么改变吗?你能分享你的代码吗?
  • yaa 我刚刚更改了 number_at_string=' ' 现在它工作正常
【解决方案2】:

使用re 表示findall 出现在列表中,join 使用逗号split 列出并将reverse() 应用于列表,然后再次使用join ' '

您的数据:

li = ['smoker', 'male', '25', 'years', 'wants', 'policy', '30', 'yrs', 'sum', 'assured', 'amount', '1000000', 'rupees']

temp=",".join([l for l in li if re.findall('1000000|30|25|male|smoker',l)]).split(",")

temp.reverse()
temp = " ".join(temp)

输出:

'1000000 30 25 male smoker'

希望这个答案有帮助。

【讨论】:

  • 这个答案和print '1000000 30 25 male smoker'差不多。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2022-06-10
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多