【问题标题】:Regular expression for recognizing different date formats [duplicate]用于识别不同日期格式的正则表达式
【发布时间】:2019-08-07 07:41:27
【问题描述】:

我必须使用正则表达式从字符串中识别不同的日期格式,如下所示。

date can contain 21/12/2018
or 12/21/2018
or 2018/12/21
or 12/2018
or 21-12-2018
or 12-21-2018
or 2018-12-21
or 21-Jan-2018
or Jan 21,2018
or 21st Jan 2018
or 21-Jan-2018
or Jan 21,2018
or 21st Jan 2018
or Jan 21, 2018
or Jan 21, 2018
or 2018 Dec. 21
or 2018 Dec 21
or 21st of Jan 2018
or 21st of Jan 2018
or Jan 2018
or Jan 2018
or Jan. 2018
or Jan, 2018
or 2018
[should recognize (year only), (year and month), (year, month and day), year is mandatory in every date format to be recognized]  
[months are abbreviated to three letters, first letter capital]

我的正则表达式如下,

\b(((((0?[1-9]|[12][0-9]|3[01])(\s*(st|nd|rd|th)?\s*(of)?\s*)?)|(20[012]\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[\/\-\.\,\s]*){1,3})\b

它没有按预期工作,并且它也有其他模式。我必须在要识别的每个日期模式中识别三个模式(year only)(year and month)(year, month and day)必须是年份需要哪些更正才能使其正常工作?请帮忙。

【问题讨论】:

  • 我没有投反对票,但我确实标记为过于广泛。你需要写一个很长的正则表达式替换。
  • 正则表达式对于解决有这么多ors 的问题是一个糟糕的选择。我认为你最好写一个解析器。
  • 21-12-2018 or 12-21-2018 – 12 月 11 日你打算做什么?
  • @enumiro,这些日期来自不同公司的 10-k 文档的列标题,我正在尝试抓取。所以无法控制输入日期格式。

标签: python regex


【解决方案1】:

IIUC,dateutil.parser 会是比re 更好的选择:

import dateutil.parser as dparser

l = ["21/12/2018","12/21/2018","2018/12/21","12/2018",
"21-12-2018","12-21-2018","2018-12-21","21-Jan-2018",
"Jan 21,2018","21st Jan 2018","21-Jan-2018","Jan 21,2018",
"21st Jan 2018","Jan 21, 2018","Jan 21, 2018","2018 Dec. 21",
"2018 Dec 21","21st of Jan 2018","21st of Jan 2018","Jan 2018",
"Jan 2018","Jan. 2018","Jan, 2018","2018"]

[str(dparser.parse(i, fuzzy=True)) for i in l]

输出:

['2018-12-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-12-07 00:00:00',
 '2018-12-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-01-21 00:00:00',
 '2019-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2019-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-01-07 00:00:00',
 '2018-01-07 00:00:00',
 '2018-01-07 00:00:00',
 '2018-01-07 00:00:00',
 '2018-08-07 00:00:00']

dateutil.parser 也可以处理句子中是否包含类似日期的内容(尽管并非总是如此):

s = 'The new millennium has finally come and it is now 1st of Jan 2000.'
str(dparser.parse(s, fuzzy=True))
# '2000-01-01 00:00:00'

【讨论】:

  • 对于那些想了解 IIUC 的人:如果我理解正确
  • 谢谢,但每个日期都是字符串的一部分,我必须找到并提取/替换。
  • @Shijith dateutil.parser 也可以处理这种情况。让我举个例子。
  • 非常感谢。这是有效的。不知道这可以用来解析字符串
  • dateutil.parser.parse(string_with_date, fuzzy_with_tokens=True),返回一个元组,第一个元素是 datetime.datetime 对象,第二个元素是包含字符串其余部分(模糊标记)的元组。例如。 string_with_date = 'date can contain 21st of January 2018 as a part of string ' 应用函数后的输出将是(datetime.datetime(2018, 1, 21, 0, 0), ('date can contain ', ' of ', ' ', 'as a part of string '))
猜你喜欢
  • 2018-09-28
  • 1970-01-01
  • 2012-10-09
  • 2012-06-20
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多