【问题标题】:Split string based on predefined character types根据预定义的字符类型拆分字符串
【发布时间】:2018-03-06 16:00:51
【问题描述】:

我有一个预定义的字符->类型字典。例如,'a' - 是小写字母,1 是数字,')' 是标点符号等。 使用以下脚本,我标记给定字符串中的所有字符:

labels=''
for ch in list(example):
    try:
        l = character_type_dict[ch]
        print(l)
        labels = labels+l
    except KeyError:
        labels = labels+'o'
        print('o')
labels

例如,给定"1,234.45kg (in metric system)" 作为输入,代码将生成dpdddpddwllwpllwllllllwllllllp 作为输出。

现在,我想根据组拆分字符串。输出应该是这样的:

['1',',','234','.','45','kg',' ','(','in',' ','metric',' ','system',')']

也就是说,它应该根据字符类型的边框进行分割。 有什么想法可以有效地做到这一点吗?

【问题讨论】:

  • 我认为labels 是错误的。它将k 视为wg 视为l
  • 哦,感谢您的关注。我可能需要调试字典创建步骤。

标签: python string parsing split


【解决方案1】:

labels 是错误的(在您的示例中是 'dpdddpddwllwpllwllllllwllllllp',但我认为应该是 'dpdddpddllwpllwllllllwllllllp'

无论如何,你可以使用滥用itertools.groupby

from itertools import groupby

example = "1,234.45kg (in metric system)"
labels = 'dpdddpddllwpllwllllllwllllllp'

output = [''.join(group)
          for _, group in groupby(example, key=lambda ch: labels[example.index(ch)])]

print(output)
# ['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']

【讨论】:

    【解决方案2】:

    您可以更简洁地计算标签(而且很可能更快):

    labels = ''.join(character_type_dict.get(ch, 'o') for ch in example)
    

    或者,使用辅助函数:

    character_type = lambda ch: character_type_dict.get(ch, 'o')
    labels = ''.join(map(character_type, example))
    

    但是你不需要标签来分割字符串;在 itertools.groupby 的帮助下,你可以直接拆分:

    splits = list(''.join(g)
                  for _, g in itertools.groupby(example, key=character_type)
    

    一个可能更有趣的结果是类型元组和相关分组的向量:

     >>> list((''.join(g), code)
     ...      for code, g in itertools.groupby(example, key=character_type))
     [('1', 'd'), (',', 'p'), ('234', 'd'), ('.', 'p'), ('45', 'd'), ('kg', 'l'),
      (' ', 'w'), ('(', 'p'), ('in', 'l'), (' ', 'w'), ('metric', 'l'), (' ', 'w'),
      ('system', 'l'), (')', 'p')]
    

    我计算character_type_dict如下:

    character_type_dict = {}
    for code, chars in (('w', string.whitespace),
                        ('l', string.ascii_letters),
                        ('d', string.digits),
                        ('p', string.punctuation)):
      for char in chars: character_type_dict[char] = code
    

    但我也可以这样做(我后来发现):

    from collections import ChainMap
    character_type_dict = dict(ChainMap(*({c:t for c in string.__getattribute__(n)}
                                        for t,n in (('w', 'whitespace')
                                                   ,('d', 'digits')
                                                   ,('l', 'ascii_letters')
                                                   ,('p', 'punctuation')))))
    

    【讨论】:

    • 感谢您提供非常全面的答案。
    【解决方案3】:

    只记得最后一个类型的类:

    import string
    character_type = {c: "l" for c in string.ascii_letters}
    character_type.update({c: "d" for c in string.digits})
    character_type.update({c: "p" for c in string.punctuation})
    character_type.update({c: "w" for c in string.whitespace})
    
    example = "1,234.45kg (in metric system)"
    
    x = []
    prev = None
    for ch in example:
        try:
            l = character_type[ch]
            if l == prev:
                x[-1].append(ch)
            else:
                x.append([ch])
        except KeyError:
            print(ch)
        else:
            prev = l
    x = map(''.join, x)
    print(list(x))
    # ['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']
    

    【讨论】:

      【解决方案4】:

      另一种算法方法。使用 dictionaryget(value, default_value) 方法而不是 try: except: 更好。

      import string
      
      character_type_dict = {}
      for ch in string.ascii_lowercase:
          character_type_dict[ch] = 'l'
      for ch in string.digits:
          character_type_dict[ch] = 'd'
      for ch in string.punctuation:
          character_type_dict[ch] = 'p'
      for ch in string.whitespace:
          character_type_dict[ch] = 'w'
      
      example = "1,234.45kg (in metric system)"
      
      split_list = []
      split_start = 0
      for i in range(len(example) - 1):
          if character_type_dict.get(example[i], 'o') != character_type_dict.get(example[i + 1], 'o'):
              split_list.append(example[split_start: i + 1])
              split_start = i + 1
      split_list.append(example[split_start:])
      
      print(split_list)
      

      【讨论】:

        【解决方案5】:

        将此作为算法难题:

        # dummy mapping
        character_type_dict = dict({c: "l" for c in string.ascii_letters}.items()  \
                                 + {c: "d" for c in string.digits}.items() \
                                 + {c: "p" for c in string.punctuation}.items() \
                                 + {c: "w" for c in string.whitespace}.items())
        example = "1,234.45kg (in metric system)"
        last = example[0]
        temp = last
        res = []
        for ch in example[1:]:
          try:
            cur = character_type_dict[ch]
            if cur != last:
              res.append(temp)
              temp = ''
            temp += ch
            last = cur
          except KeyError:
            last = 'o'
        res.append(temp)
        

        结果:

        ['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']
        

        【讨论】: