用于以非均匀行数排列的文本记录的 Python-Pandas 解析器答案

【问题标题】：Python-Pandas Parser for Text Records arranged in non-uniform number of Rows-Lines用于以非均匀行数排列的文本记录的 Python-Pandas 解析器
【发布时间】：2017-05-02 13:55:57
【问题描述】：

我对 Python 和 pandas 很陌生，我想解析一个非常大的文本文件（200-500 MB），其中的信息显示为记录，每条记录都包含作为行/行而不是列的数据字段，以及具有不同行数的此类记录（因为它们提供不同类型的数据）。下面是一个例子：

....
Record Type  = Record_Name
  attribute1 = 3090 (0x01218)
  attribute2     = attribute_name3090 (type: type_name1)
  timestamp: 21:47:33.360000
  attribute4: 731001-1 (0x0b277911)
  attribute5: 50000 (0x000000c350)
  attribute6: 3934 (0x0000000f5e)
  attribute7: 857 (0x0000000359)
Record Type  = Record_Name  
  attribute1 = 3099 (0x01227)
  attribute2     = attribute_name3099 (type: type_name2)
  timestamp: 21:49:07.359000
  attribute4     = 731001-3 (0x0b277911)
  attribute8: 14 (0x0000000e)
  attribute9: 17 (0x00000011)
  attribute10: 43 (0x0000002b)
  attribute11: 40 (0x00000028)
Record Type  = Record_Name
  attribute1 = 3090 (0x01218)
  attribute2     = attribute_name3090 (type: type_name1)
  timestamp: 21:51:17.142000
  attribute4: 942101-2 (0x0b345872)
  attribute5: 2490368 (0x00260000)
  attribute6: 24 (0x00000018)
  attribute7: 25 (0x00000019)
Record Type  = Record_Name
  attribute1 = 3102 (0x01230)
  attribute2     = attribute_name3102 (type: type_name1)
  timestamp: 21:52:42.359000
  attribute4: 731001-2 (0x0b277911)
  attribute12: 0 (0x0000000000)
  attribute13: 80 (0x0000000050)
....

因此，对于后处理，我想将所需数据提取到几个 pandas 数据帧中，一个用于汇总记录，另一个用于过滤特定类型的记录。

数据帧 1 汇总记录表：创建一个按时间戳索引的数据框，仅显示每条记录的几个属性，所有这些都保持字符串格式但没有 () 内的十六进制值：

                  attribute1  attribute2          type         attribute4
timestamp
...
21:47:33.360000   3090        attribute_name3090  type_name1   731001-1
21:49:07.359000   3099        attribute_name3099  type_name2   731001-3
21:51:17.142000   3090        attribute_name3090  type_name1   942101-2
21:52:42.359000   3102        attribute_name3102  type_name1   731001-2
....

数据帧 2 过滤记录：仅捕获与 attribute1 = 3090（类型 = typename1）关联的记录，并创建以下按时间戳索引的数据帧，其中 attributes5-7 没有 () 中的十六进制值，并从字符串转换为整数：

                  attribute4   attribute5   attribute6   attribute7   
timestamp
...
21:47:33.360000   731001-1     5000         3934         857
21:51:17.142000   942101-2     2490368      24           25
....

我尝试打开文件并读取行，但这需要大量时间。我阅读了有关“生成器”的信息，但不知道如何使用它们来简化代码。您的建议将不胜感激。提前致谢。

格斯

【问题讨论】：

查看这个答案：stackoverflow.com/questions/13651117/…
您好泰德，感谢您的反馈。原始文件是 .txt（文本）而不是 csv，这对我来说更具挑战性。我已经在 panda 中加载了 csv 文件文件并使用它们......在这种特殊情况下，我想（1）读取 .txt，（2）解析并将数据放入两个 pd.dataframes，然后（ 3) 将输出保存为 CSV。我的问题在于（2）。研究生，
请看一下这个topic。

标签： python parsing pandas text

【解决方案1】：

此代码显示了如何以简单的方式将输入行解析为两个字典。它假设输入是非常规则的。主要动作是在读取每一行之后，其中行被空格分割，冒号被删除（假设数据中没有）并且等号被删除 - 所有这些都是为了简化基于每行中的第一项的后续处理.

如果您运行代码，您会发现它创建了两个文本文件，每个文件每行包含一个字典。

with open ('DF1.txt', 'w') as DF1:
    with open ('DF2.txt', 'w') as DF2:

        DF1_record = {}
        DF2_record = {}
        with open('bigText.txt') as bigText:
            for inputLine in bigText:
                inputLine = [_.replace(':', '') for _ in inputLine.strip().split()]
                if '=' in inputLine:
                    inputLine.remove('=')
                kind = inputLine[0]
                if kind=='Record':
                    if DF1_record:
                        DF1.write(str(DF1_record)+'\n')
                    if DF2_record:
                        DF2.write(str(DF2_record)+'\n')
                        DF1_record={}
                        DF2_record={}
                    continue
                if kind in ['attribute1', 'attribute2', 'attribute4', 'timestamp']:
                    if kind=='attribute2':
                        DF1_record[kind]=inputLine[1]
                        DF1_record['type']=inputLine[3][:-1]
                    else:
                        DF1_record[kind]=inputLine[1]
                if kind in ['attribute1', 'attribute4', 'attribute4', 'attribute4', 'attribute7', 'timestamp']:
                    if DF1_record['attribute1']=='3090':
                        DF2_record[kind]=inputLine[1]

【讨论】：