提取一些可选的标记答案

【问题标题】：Extracting tokens where some are optional提取一些可选的标记
【发布时间】：2010-08-02 19:54:50
【问题描述】：

我需要从一个字符串中解析出时间标记，其中标记是可选的。给出的样本：

tt-5d10h
tt-5d10h30m
tt-5d30m
tt-10h30m
tt-5d
tt-10h
tt-30m

如何在 Python 中将其解析为最好的集合（天、小时、分钟）？

【问题讨论】：

【解决方案1】：

此程序为每个输入返回三个整数（天、小时、秒）：

import re
samples = ['tt-5d10h', 'tt-5d10h30m', 'tt-5d30m', 'tt-10h30m', 'tt-5d', 'tt-10h', 'tt-30m',]

def parse(text):
    match = re.match('tt-(?:(\d+)d)?(?:(\d+)h)?(?:(\d+)m)?', text)
    values = [int(x) for x in match.groups(0)]
    return values

for sample in samples:
    print parse(sample)

输出：

[5, 10, 0]
[5, 10, 30]
[5, 0, 30]
[0, 10, 30]
[5, 0, 0]
[0, 10, 0]
[0, 0, 30]

【讨论】：

【解决方案2】：

>>> pattern = re.compile("tt-(\d+d)?(\d+h)?(\d+m)?")
>>> results = pattern.match("tt-5d10h")
>>> days, hours, minutes = results.groups()
>>> days, hours, minutes
('5d', '10h', None)

【讨论】：

【解决方案3】：

类似于 compie 的答案，但使最终结果更好处理：

re.match('tt-(?:(?P<days>\d+)d)?(?:(?P<hours>\d+)h)?(?:(?P<minutes>\d+)m)?', text).groupdict()

例子：

>>> import re
>>> s = ['tt-5d10h', 'tt-5d10h30m', 'tt-5d30m', 'tt-10h30m', 'tt-5d', 'tt-10h', 'tt-30m']
>>> for text in s:
    print(re.match('tt-(?:(?P<days>\d+)d)?(?:(?P<hours>\d+)h)?(?:(?P<minutes>\d+)m)?', text).groupdict())

{'hours': '10', 'minutes': None, 'days': '5'}
{'hours': '10', 'minutes': '30', 'days': '5'}
{'hours': None, 'minutes': '30', 'days': '5'}
{'hours': '10', 'minutes': '30', 'days': None}
{'hours': None, 'minutes': None, 'days': '5'}
{'hours': '10', 'minutes': None, 'days': None}
{'hours': None, 'minutes': '30', 'days': None}

如果您想用 0 代替遗漏的标记，只需使用 groupdict(0) 而不是 groupdict()。

【讨论】：

【解决方案4】：

按分区：

inputstring="""tt-5d10h
tt-5d10h30m
tt-5d30m
tt-10h30m
tt-5d
tt-10h
tt-30m
"""
separators=('d','h','m')
result=[]
for text in (item.lstrip('t-') for item in inputstring.splitlines()):
    data=[]
    for sep in separators:
        d,found,text = text.partition(sep)
        if found: data.append(int(d.rstrip(sep)))
        else:
            data.append(0)
            text=d
    result.append(data)
# show input and result
for respairs in zip(inputstring.splitlines(),result): print(respairs)
""" Output:
('tt-5d10h', [5, 10, 0])
('tt-5d10h30m', [5, 10, 30])
('tt-5d30m', [5, 0, 30])
('tt-10h30m', [0, 10, 30])
('tt-5d', [5, 0, 0])
('tt-10h', [0, 10, 0])
('tt-30m', [0, 0, 30])
"""

【讨论】：

【解决方案5】：

这是解决问题的 pyparsing 方法：

tests = """tt-5d10h 
tt-5d10h30m 
tt-5d30m 
tt-10h30m 
tt-5d 
tt-10h 
tt-30m""".splitlines()

from pyparsing import Word,nums,Optional

integer = Word(nums).setParseAction(lambda t:int(t[0]))

timeFormat = "tt-" + (
                Optional(integer("days") + "d") +
                Optional(integer("hrs")  + "h") +
                Optional(integer("mins") + "m")
                )

def normalizeTime(tokens):
    return tuple(tokens[field] if field in tokens else 0 
                for field in "days hrs mins".split())

timeFormat.setParseAction(normalizeTime)

for test in tests:
    print "%-12s ->" % test, 
    print "%d %02d:%02d" % timeFormat.parseString(test)[0]

打印：

tt-5d10h     -> 5 10:00
tt-5d10h30m  -> 5 10:30
tt-5d30m     -> 5 00:30
tt-10h30m    -> 0 10:30
tt-5d        -> 5 00:00
tt-10h       -> 0 10:00
tt-30m       -> 0 00:30

或者保留命名结果：

def normalizeTime(tokens):
    for field in "days hrs mins".split():
        if field not in tokens:
            tokens[field] = 0

timeFormat.setParseAction(normalizeTime)

for test in tests:
    print "%-12s ->" % test, 
    print "%(days)d %(hrs)02d:%(mins)02d" % timeFormat.parseString(test)

【讨论】：