解析生成的文件 Python答案

【问题标题】：Parse a Generated File Python解析生成的文件 Python
【发布时间】：2020-09-08 17:41:55
【问题描述】：

我正在尝试将生成的文件解析为对象列表。

不幸的是，生成的文件的结构并不总是相同，但它们包含相同的字段（以及许多其他垃圾）。

例如：

    function foo();              # Don't Care
    function maybeanotherfoo();  # Don't Care
    int maybemoregarbage;        # Don't Care

    
    product_serial = "CDE1102"; # I want this <---------------------
    unnecessary_info1 = 10;     # Don't Care
    unnecessary_info2 = "red"   # Don't Care
    product_id = 1134412;       # I want this <---------------------
    unnecessary_info3 = "88"    # Don't Care

    product_serial = "DD1232";  # I want this <---------------------
    product_id = 3345111;       # I want this <---------------------
    unnecessary_info1 = "22"    # Don't Care
    unnecessary_info2 = "panda" # Don't Care

    product_serial = "CDE1102"; # I want this <---------------------
    unnecessary_info1 = 10;     # Don't Care
    unnecessary_info2 = "red"   # Don't Care
    unnecessary_info3 = "bear"  # Don't Care
    unnecessary_info4 = 119     # Don't Care
    product_id = 1112331;       # I want this <---------------------
    unnecessary_info5 = "jj"    # Don't Care

我想要一个对象列表（每个对象都有：序列号和 ID）。

我尝试了以下方法：


import re

class Product:
    def __init__(self, id, serial):
        self.product_id = id
        self.product_serial = serial

linenum = 0
first_string = "product_serial"
second_string = "product_id"
with open('products.txt', "r") as products_file:
    for line in products_file:
        linenum += 1
        if line.find(first_string) != -1:
            product_serial = re.search('\"([^"]+)', line).group(1)
            #How do I proceed?

任何建议将不胜感激！谢谢！

【问题讨论】：

那么你的代码是做什么的？它有效吗？有错误吗？如果有，它们是什么？
我的代码可以找到第一个product_serial（CDE1102）。但是我怎样才能找到 product_id 然后从那时起继续解析呢？
请从intro tour 重复on topic 和how to ask。 “告诉我如何解决这个编码问题”不是堆栈溢出问题。你必须做出诚实的尝试，然后然后就你的算法或技术提出一个具体的问题。 “任何建议”对于 Stack Overflow 来说过于宽泛。有许多教程向您展示如何读取文件、如何处理字符串数据等。您应该能够识别输入中的常量字符串并分隔输入行。

标签： python parsing

【解决方案1】：

我在此处使用 io.StringIO() 内联数据，但您可以将 data 替换为您的 products_file。

我们的想法是我们将键/值收集到current_object，一旦我们知道我们需要单个对象（两个键）的所有数据，我们就将其推送到objects 的列表中并启动一个新的current_object。

您可以使用 if line.startswith('product_serial') 之类的东西，而不是公认的复杂的正则表达式。

import io
import re

data = io.StringIO("""
    function foo();             
    function maybeanotherfoo(); 
    int maybemoregarbage;       

    
    product_serial = "CDE1102"; 
    unnecessary_info1 = 10;     
    unnecessary_info2 = "red"   
    product_id = 1134412;       
    unnecessary_info3 = "88"    

    product_serial = "DD1232";  
    product_id = 3345111;       
    unnecessary_info1 = "22"    
    unnecessary_info2 = "panda" 

    product_serial = "CDE1102"; 
    unnecessary_info1 = 10;     
    unnecessary_info2 = "red"   
    unnecessary_info3 = "bear"  
    unnecessary_info4 = 119     
    product_id = 1112331;       
    unnecessary_info5 = "jj"    
""")

objects = []

current_object = {}
for line in data:
    line = line.strip()  # Remove leading and trailing whitespace
    m = re.match(r'^(product_id|product_serial)\s*=\s*(\d+|"(?:.+?)");?$', line)

    if m:
        key, value = m.groups()
        current_object[key] = value.strip('"')
        if len(current_object) == 2:  # Got the two keys we want, ship the object
            objects.append(current_object)
            current_object = {}

print(objects)

【讨论】：