从字符串中解析长格式日期答案

【问题标题】：Parsing long form dates from string从字符串中解析长格式日期
【发布时间】：2022-11-17 02:49:09
【问题描述】：

我知道对于堆栈溢出的类似问题还有其他解决方案，但它们不适用于我的特定情况。

我有一些字符串——这里有一些例子。

string_with_dates = "random non-date text, 22 May 1945 and 11 June 2004"
string2 = "random non-date text, 01/01/1999 & 11 June 2004"
string3 = "random non-date text, 01/01/1990, June 23 2010"
string4 = "01/2/2010 and 25th of July 2020"
string5 = "random non-date text, 01/02/1990"
string6 = "random non-date text, 01/02/2010 June 10 2010"

我需要一个解析器，它可以确定字符串中有多少个类似日期的对象，然后将它们解析为列表中的实际日期。我在那里找不到任何解决方案。这是所需的输出：


['05/22/1945','06/11/2004']

或者作为实际的日期时间对象。有任何想法吗？

我已经尝试过此处列出的解决方案，但它们不起作用。 How to parse multiple dates from a block of text in Python (or another language)

这是当我尝试该链接中建议的解决方案时发生的情况：


import itertools
from dateutil import parser

jumpwords = set(parser.parserinfo.JUMP)
keywords = set(kw.lower() for kw in itertools.chain(
    parser.parserinfo.UTCZONE,
    parser.parserinfo.PERTAIN,
    (x for s in parser.parserinfo.WEEKDAYS for x in s),
    (x for s in parser.parserinfo.MONTHS for x in s),
    (x for s in parser.parserinfo.HMS for x in s),
    (x for s in parser.parserinfo.AMPM for x in s),
))

def parse_multiple(s):
    def is_valid_kw(s):
        try:  # is it a number?
            float(s)
            return True
        except ValueError:
            return s.lower() in keywords

    def _split(s):
        kw_found = False
        tokens = parser._timelex.split(s)
        for i in xrange(len(tokens)):
            if tokens[i] in jumpwords:
                continue 
            if not kw_found and is_valid_kw(tokens[i]):
                kw_found = True
                start = i
            elif kw_found and not is_valid_kw(tokens[i]):
                kw_found = False
                yield "".join(tokens[start:i])
        # handle date at end of input str
        if kw_found:
            yield "".join(tokens[start:])

    return [parser.parse(x) for x in _split(s)]

parse_multiple(string_with_dates)

输出：


ParserError: Unknown string format: 22 May 1945 and 11 June 2004

另一种方法：


from dateutil.parser import _timelex, parser

a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"

p = parser()
info = p.info

def timetoken(token):
  try:
    float(token)
    return True
  except ValueError:
    pass
  return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))

def timesplit(input_string):
  batch = []
  for token in _timelex(input_string):
    if timetoken(token):
      if info.jump(token):
        continue
      batch.append(token)
    else:
      if batch:
        yield " ".join(batch)
        batch = []
  if batch:
    yield " ".join(batch)

for item in timesplit(string_with_dates):
  print "Found:", (item)
  print "Parsed:", p.parse(item)

输出：



ParserError: Unknown string format: 22 May 1945 11 June 2004

有任何想法吗？

【问题讨论】：

您从链接中找到的解决方案到底有什么不起作用？
对于该链接中的所有方法，我收到此错误：“ParserError：未知字符串格式：1945 年 5 月 22 日和 2004 年 6 月 11 日”
你能举一个你试过的例子吗？此外，带日期的字符串在日期之间是否具有一致的格式，还是有所不同？您必须确保可以解析这些多个场景。
刚刚更新以包括我已经尝试过的功能以及它们产生的错误
尝试使用 .split() 将两个日期分隔成单独的字符串，然后分别解析这些日期。

标签： python python-3.x string date parsing

【解决方案1】：

好吧，对花时间在这上面的任何人表示抱歉——但我能够回答我自己的问题。留下这个以防其他人有同样的问题。

这个包能够完美地工作：https://pypi.org/project/datefinder/


import datefinder

def DatesToList(x):
    
    dates = datefinder.find_dates(x)
    
    lists = []
    
    for date in dates:
        
        lists.append(date)
        
    return (lists)

dates = DateToList(string_with_dates)

输出：


[datetime.datetime(1945, 5, 22, 0, 0), datetime.datetime(2004, 6, 11, 0, 0)]

【讨论】：