【问题标题】:Extracting tokens where some are optional提取一些可选的标记
【发布时间】:2010-08-02 19:54:50
【问题描述】:

我需要从一个字符串中解析出时间标记,其中标记是可选的。给出的样本:

  • tt-5d10h
  • tt-5d10h30m
  • tt-5d30m
  • tt-10h30m
  • tt-5d
  • tt-10h
  • tt-30m

如何在 Python 中将其解析为最好的集合(天、小时、分钟)?

【问题讨论】:

    标签: python regex string


    【解决方案1】:

    此程序为每个输入返回三个整数(天、小时、秒):

    import re
    samples = ['tt-5d10h', 'tt-5d10h30m', 'tt-5d30m', 'tt-10h30m', 'tt-5d', 'tt-10h', 'tt-30m',]
    
    def parse(text):
        match = re.match('tt-(?:(\d+)d)?(?:(\d+)h)?(?:(\d+)m)?', text)
        values = [int(x) for x in match.groups(0)]
        return values
    
    for sample in samples:
        print parse(sample)
    

    输出:

    [5, 10, 0]
    [5, 10, 30]
    [5, 0, 30]
    [0, 10, 30]
    [5, 0, 0]
    [0, 10, 0]
    [0, 0, 30]
    

    【讨论】:

      【解决方案2】:
      >>> pattern = re.compile("tt-(\d+d)?(\d+h)?(\d+m)?")
      >>> results = pattern.match("tt-5d10h")
      >>> days, hours, minutes = results.groups()
      >>> days, hours, minutes
      ('5d', '10h', None)
      

      【讨论】:

        【解决方案3】:

        类似于 compie 的答案,但使最终结果更好处理:

        re.match('tt-(?:(?P<days>\d+)d)?(?:(?P<hours>\d+)h)?(?:(?P<minutes>\d+)m)?', text).groupdict()
        

        例子:

        >>> import re
        >>> s = ['tt-5d10h', 'tt-5d10h30m', 'tt-5d30m', 'tt-10h30m', 'tt-5d', 'tt-10h', 'tt-30m']
        >>> for text in s:
            print(re.match('tt-(?:(?P<days>\d+)d)?(?:(?P<hours>\d+)h)?(?:(?P<minutes>\d+)m)?', text).groupdict())
        
        {'hours': '10', 'minutes': None, 'days': '5'}
        {'hours': '10', 'minutes': '30', 'days': '5'}
        {'hours': None, 'minutes': '30', 'days': '5'}
        {'hours': '10', 'minutes': '30', 'days': None}
        {'hours': None, 'minutes': None, 'days': '5'}
        {'hours': '10', 'minutes': None, 'days': None}
        {'hours': None, 'minutes': '30', 'days': None}
        

        如果您想用 0 代替遗漏的标记,只需使用 groupdict(0) 而不是 groupdict()

        【讨论】:

          【解决方案4】:

          按分区:

          inputstring="""tt-5d10h
          tt-5d10h30m
          tt-5d30m
          tt-10h30m
          tt-5d
          tt-10h
          tt-30m
          """
          separators=('d','h','m')
          result=[]
          for text in (item.lstrip('t-') for item in inputstring.splitlines()):
              data=[]
              for sep in separators:
                  d,found,text = text.partition(sep)
                  if found: data.append(int(d.rstrip(sep)))
                  else:
                      data.append(0)
                      text=d
              result.append(data)
          # show input and result
          for respairs in zip(inputstring.splitlines(),result): print(respairs)
          """ Output:
          ('tt-5d10h', [5, 10, 0])
          ('tt-5d10h30m', [5, 10, 30])
          ('tt-5d30m', [5, 0, 30])
          ('tt-10h30m', [0, 10, 30])
          ('tt-5d', [5, 0, 0])
          ('tt-10h', [0, 10, 0])
          ('tt-30m', [0, 0, 30])
          """
          

          【讨论】:

            【解决方案5】:

            这是解决问题的 pyparsing 方法:

            tests = """tt-5d10h 
            tt-5d10h30m 
            tt-5d30m 
            tt-10h30m 
            tt-5d 
            tt-10h 
            tt-30m""".splitlines()
            
            from pyparsing import Word,nums,Optional
            
            integer = Word(nums).setParseAction(lambda t:int(t[0]))
            
            timeFormat = "tt-" + (
                            Optional(integer("days") + "d") +
                            Optional(integer("hrs")  + "h") +
                            Optional(integer("mins") + "m")
                            )
            
            def normalizeTime(tokens):
                return tuple(tokens[field] if field in tokens else 0 
                            for field in "days hrs mins".split())
            
            timeFormat.setParseAction(normalizeTime)
            
            for test in tests:
                print "%-12s ->" % test, 
                print "%d %02d:%02d" % timeFormat.parseString(test)[0]
            

            打印:

            tt-5d10h     -> 5 10:00
            tt-5d10h30m  -> 5 10:30
            tt-5d30m     -> 5 00:30
            tt-10h30m    -> 0 10:30
            tt-5d        -> 5 00:00
            tt-10h       -> 0 10:00
            tt-30m       -> 0 00:30
            

            或者保留命名结果:

            def normalizeTime(tokens):
                for field in "days hrs mins".split():
                    if field not in tokens:
                        tokens[field] = 0
            
            timeFormat.setParseAction(normalizeTime)
            
            for test in tests:
                print "%-12s ->" % test, 
                print "%(days)d %(hrs)02d:%(mins)02d" % timeFormat.parseString(test)
            

            【讨论】:

              猜你喜欢
              • 1970-01-01
              • 2016-11-23
              • 1970-01-01
              • 2011-02-08
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 2017-01-28
              • 1970-01-01
              相关资源
              最近更新 更多