【问题标题】:Split a string and save the substrings to dict. Python拆分字符串并将子字符串保存到 dict。 Python
【发布时间】:2014-03-21 19:42:54
【问题描述】:

我有一个这样的文本文件:

771 776 #1 556.766700(2)
538 #2 1069.652700(2)
531 #3 1074.407600(2)
81 84 89 94 111 #4 1501.062900(2)
85 91 #5 782.298900(3)
32 42 66 71 90 95 101 #6 904.016500(3)

我想将子字符串拆分并保存到不同的变量中,如下所示: 例如在第 1 行:

scans= 771 776, uid = 1 mz = 556.766700, z = 2

我正在尝试使用以下代码,但我需要正则表达式方面的帮助:

f = open(filename, 'r')
par_info=[]
for rows in f:
    re.sub('\#(.+)\s(.+)\((.+)\+', scans=\g<1>, uid=\g<2>, mz = int(\g<3>),    z=int(\g<4>), rest)
    info={'sc_num':scans, 'ident':uid, 'mass':mz, 'charge':z}
    par_info.append(info)

【问题讨论】:

  • 运行代码时会发生什么?
  • 我实际上只是在字符串上尝试了正则表达式并得到以下错误:SyntaxError:行继续字符后的意外字符
  • 请注意,您的代码缺少许多引号和=,而info 字典中应该有:

标签: python regex string dictionary


【解决方案1】:

您可以使用命名组:

>>> import pprint
>>> import re
>>> r = re.compile(r'(?P<scans>.*?)\s+#(?P<uid>\d+)\s+(?P<mz>\d+\.\d+)\((?P<z>\d+)\)')
>>> with open('abc1') as f:
        par_info = [r.search(line).groupdict() for line in f]
...     
>>> pprint.pprint(par_info)
[{'mz': '556.766700', 'scans': '771 776', 'uid': '1', 'z': '2'},
 {'mz': '1069.652700', 'scans': '538', 'uid': '2', 'z': '2'},
 {'mz': '1074.407600', 'scans': '531', 'uid': '3', 'z': '2'},
 {'mz': '1501.062900', 'scans': '81 84 89 94 111', 'uid': '4', 'z': '2'},
 {'mz': '782.298900', 'scans': '85 91', 'uid': '5', 'z': '3'},
 {'mz': '904.016500', 'scans': '32 42 66 71 90 95 101', 'uid': '6', 'z': '3'}]

【讨论】:

    【解决方案2】:
    import re
    pattern = re.compile("(\d+\s*\d*)\s+#(\d+)\s+([\d\.]+)\s*\((\d+)\)")
    for line in open("Input.txt"):
        scans, uid, mz, z = pattern.findall(line)[0]
        print scans, uid, mz, z
    

    输出

    771 776 1 556.766700 2
    538 2 1069.652700 2
    531 3 1074.407600 2
    94 111 4 1501.062900 2
    85 91 5 782.298900 3
    95 101 6 904.016500 3
    

    正则表达式演示

    Debuggex Demo

    【讨论】:

      【解决方案3】:

      这个正则表达式有效,然后你可以将找到的组压缩在一起并将它们变成一个字典:

      In [1]: import re
      
      In [2]: a = "771 776 #1 556.766700(2)"
      
      In [3]: c = re.compile(r'([\d\s]+)\s#(\d)+\s([\d\.]+)\((\d+)\)')
      
      In [4]: titles = ('sc_num', 'ident', 'mass', 'charge')
      
      In [5]: dict(zip(titles, c.search(a).groups()))
      Out[5]: {'charge': '2', 'ident': '1', 'mass': '556.766700', 'sc_num': '771 776'}
      

      把它和你的代码放在一起,你会得到这个:

      f = open(filename, 'r')
      c = re.compile(r'([\d\s]+)\s#(\d)+\s([\d\.]+)\((\d+)\)')
      titles = ('sc_num', 'ident', 'mass', 'charge')
      par_info=[]
      for row in f:
          info = dict(zip(titles, c.search(row).groups()))
          par_info.append(info)
      

      【讨论】:

        【解决方案4】:

        如果您的数据始终像示例中那样结构化,我会使用split

        par_info = []
        with open(filename, 'r') as f:
            for line in f:
                scan, other = line.split("#")
                uid, more = other.split()
                mz, z = other.split('(')
                z = z.replace(')','')
                info = {'sc_num': scans, 'ident': uid, 'mass': mz, 'charge': z}
                par_info.append(info)
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2013-03-07
          • 1970-01-01
          • 1970-01-01
          • 2020-12-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多