【问题标题】:Regular expression to match range of dates with months included正则表达式匹配包含月份的日期范围
【发布时间】:2019-09-26 01:01:28
【问题描述】:

我需要匹配一个字符串来识别它是否是有效的日期范围,我的字符串可以包括文本中的月份和数字中的年份,没有特定的顺序(没有固定格式,如 MM-YYYY-DD 等)。

一个有效的字符串可能是:

February 2016 - March 2019

September 2015 to August 2019

April 2015 to present

September 2018 - present

无效字符串:

George Mason University august 2019

Stratusburg university February 2018

Some text and month followed by year

我已经研究过诸如 a)Constructing Regular Expressions to match numeric ranges

b)Regex to match month name followed by year

还有许多其他问题,但这些问题中的大多数输入字符串似乎都有一些固定的月份和年份模式,而我没有。

我在 python 中尝试了这个正则表达式:

import re

pat = r"(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?(\d{1,2}(st|nd|rd|th)?)?(([,.\-\/])\D?)?((19[7-9]\d|20\d{2})|\d{2})*"

st =  "University of Pennsylvania February 2018"

re.search(pat, st)

但是从我的示例中识别有效和无效字符串,我想在我的最终输出中避免无效字符串。

对于输入“University of Pennsylvania 2018 年 2 月”,预期输出应为 False

对于“2018 年 2 月至今”,输出必须为 True。

【问题讨论】:

标签: python regex string date


【解决方案1】:

此 REGEX 验证遵守此格式 MONTH YEAR (MONTH YEAR | PRESENT) 的日期范围

import re
# just for complexity adding to valid range in first line
text = """
February 2016 - March 2019 February 2017 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
"""
# writing the REGEX in one line will make it very UGLY 
MONTHS_RE = ['Jan(?:uary)?', 'Feb(?:ruary)', 'Mar(?:ch)', 'Apr(?:il)?', 'May', 'Jun(?:e)?', 'Aug(?:ust)?', 'Sep(?:tember)?',
             '(?:Nov|Dec)(?:ember)?']
# to match MONTH NAME and capture it (Jan(?:uary)?|Feb(?:ruary)...|(?:Nov|Dec)(?:ember)?)
RE_MONTH = '({})'.format('|'.join(MONTHS_RE))
# THIS MATCHE  MONTH FOLLOWED BY YEAR{2 or 4} I will use two times in Final REGEXP
RE_DATE = '{RE_MONTH}(?:[\s]+)(\d{{2,4}})'.format(RE_MONTH=RE_MONTH)
# FINAL REGEX
RE_VALID_RANGE = re.compile('{RE_DATE}.+?(?:{RE_DATE}|(present))'.format(RE_DATE=RE_DATE), flags=re.IGNORECASE)


# if you want to extract both valid an invalide
valid_ranges = []
invalid_ranges = []
for line in text.split('\n'):
    if line:
        groups = re.findall(RE_VALID_RANGE, line)
        if groups:
            # If you want to do something with range
            # all valid ranges are here my be 1 or 2 depends on the number of valid range in one line
            # every group have 4 elements because there is 4 capturing group
            # if M2,Y2 are not empty present is empty or the inverse only one of them is there (because of (?:{RE_DATE}|(present)) )
            M1, Y1, M2, Y2, present = groups[0]  # here use loop if you want to verify the values even more
            valid_ranges.append(line)
        else:
            invalid_ranges.append(line)

print('VALID: ', valid_ranges)
print('INVALID:', invalid_ranges)


# this yields only valid ranges if there is 2 in one line will yield two valid ranges
# if you are dealing with lines this is not what you want
valid_ranges = []
for match in re.finditer(RE_VALID_RANGE, text):
    # if you want to check the ranges
    M1, Y1, M2, Y2, present = match.groups()
    valid_ranges.append(match.group(0))  # the text is returned here
print('VALID USING <finditer>: ',  valid_ranges)

输出:

VALID:  ['February 2016 - March 2019 February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']
INVALID: ['George Mason University august 2019', 'Stratusburg university February 2018', 'Some text and month followed by year']
VALID USING <finditer>:  ['February 2016 - March 2019', 'February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']

我讨厌在单个 str 变量中编写冗长的正则表达式,我喜欢打破它以了解它在六个月后阅读我的代码时的作用。请注意如何使用finditer 将第一行划分为两个有效范围字符串

如果你只想提取范围,你可以使用这个:

valid_ranges = re.findall(RE_VALID_RANGE, text)

但这会返回组 ([M1, Y1, M2, Y2, present)..] 而不是文本:

[('February', '2016', 'March', '2019', ''), ('February', '2017', 'March', '2019', ''), ('September', '2015', 'August', '2019', ''), ('April', '2015', '', '', 'present'), ('September', '2018', '', '', 'present')]

【讨论】:

  • Jun(?:e)? = June?
  • @CharifDZ 我希望我能在无限循环中给你点赞!感谢您提供清晰准确的解释。这正是我想要的。显示您的编码效率。
  • @CharifDZ 我想做一个微小的改变(它完全是可选的,但也想包括这个边缘情况)。我想将 pals october 2018 to october 2019 视为有效,我将 re 更改为 '([\w]\s)?{RE_DATE}.+?(?:{RE_DATE}|(present))' 但上面的句子没有被视为有效,对此有什么建议吗?通常我想将optional_word month1 year1 ( - or to ) month2 year2 | present 视为有效。
  • 这句话到底是什么?对我来说,它看起来像另一个?为什么不匹配!!你能用这句话发表评论吗
  • @CharifDZ 打错了,我已经添加了october,它成功了,我的错!
【解决方案2】:

也许,您可以通过一些简单的方法来减少表达的界限,例如:

(?i)^\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?$

或许,

(?i)\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?

测试

import re

regex = r"(?i)^\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?$"
string = """
February 2016 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
Feb. 2016 - March 2019
Sept 2015 to Aug. 2019
April 2015 to present
Nov. 2018 - present

Invalid string:
George Mason University august 2019

Stratusburg university February 2018

Some text and month followed by year
"""

print(re.findall(regex, string, re.M))

输出

[('20', '16', 'March', '20', '19'), ('20', '15', 'August', '20', '19'), ('20', '15', 'present', '', ''), ('20', '18', 'present', '', ''), ('20', '16', 'March', '20', '19'), ('20', '15', 'Aug.', '20', '19'), ('20', '15', 'present', '', ''), ('20', '18', 'present', '', '')]

如果您希望简化/修改/探索表达式,在regex101.com 的右上角面板中已对此进行了说明。如果您愿意,您还可以在this link 中观看它如何与一些示例输入匹配。


【讨论】:

  • 感谢解释和代码,我试试看。
猜你喜欢
  • 2015-12-25
  • 2019-09-27
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-10-06
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多