【问题标题】:Parsing an XML File to CSV without hardcoding values将 XML 文件解析为 CSV 而不对值进行硬编码
【发布时间】:2021-01-27 02:15:12
【问题描述】:

我想知道是否有一种方法可以解析 XML 并基本上获取所有标签(或尽可能多地)并将它们放入列中而无需硬编码。

例如我的 xml 中的 eventType 标签。我希望它最初创建一个名为“eventType”的列,并将值放在该列下方。它解析的每个“eventType”标签都会放在同一列中。

这通常是我试图使它看起来像这样的方式:

这是 XML 示例:

<?xml version="1.0" encoding="UTF-8"?>

<faults version="1" xmlns="urn:nortel:namespaces:mcp:faults" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:nortel:namespaces:mcp:faults NortelFaultSchema.xsd ">
    <family longName="1OffMsgr" shortName="OOM"/>
    <family longName="ACTAGENT" shortName="ACAT">
        <logs>
           <log>
                <eventType>RES</eventType>
                <number>1</number>
                <severity>INFO</severity>
                <descTemplate>
                     <msg>Accounting is enabled upon this NE.</msg>
               </descTemplate>
               <note>This log is generated when setting a Session Manager's AM from &lt;none&gt; to a valid AM.</note>
               <om>On all instances of this Session Manager, the &lt;NE_Inst&gt;:&lt;AM&gt;:STD:acct OM row in the  StdRecordStream group will appear and start counting the recording units sent to the configured AM.
                   On the configured AM, the &lt;NE_inst&gt;:acct OM rows in RECSTRMCOLL group will appear and start counting the recording units received from this Session Manager's instances.
               </om>
            </log>
           <log>
                <eventType>RES</eventType>
                <number>2</number>
                <severity>ALERT</severity>
                <descTemplate>
                     <msg>Accounting is disabled upon this NE.</msg>
               </descTemplate>
               <note>This log is generated when setting a Session Manager's AM from a valid AM to &lt;none&gt;.</note>
               <action>If you do not intend for the Session Manager to produce accounting records, then no action is required.  If you do intend for the Session Manager to produce accounting records, then you should set the Session Manager's AM to a valid AM.</action>
               <om>On all instances of this Session Manager, the &lt;NE_Inst&gt;:&lt;AM&gt;:STD:acct OM row in the StdRecordStream group that matched the previous datafilled AM will disappear.
                   On the previously configured AM, the  &lt;NE_inst&gt;:acct OM rows in RECSTRMCOLL group will disappear.
               </om>
            </log>
        </logs>
    </family>
    <family longName="ACODE" shortName="AC">
        <alarms>
            <alarm>
                <eventType>ADMIN</eventType>
                <number>1</number>
                <probableCause>INFORMATION_MODIFICATION_DETECTED</probableCause>
                <descTemplate>
                    <msg>Configured data for audiocode server updated: $1</msg>
                     <param>
                         <num>1</num>
                         <description>AudioCode configuration data got updated</description>
                         <exampleValue>acgwy1</exampleValue>
                     </param>
               </descTemplate>
               <manualClearable></manualClearable>
               <correctiveAction>None. Acknowledge/Clear alarm and deploy the audiocode server if appropriate.</correctiveAction>
               <alarmName>Audiocode Server Updated</alarmName>
               <severities>
                     <severity>MINOR</severity>
               </severities>               
            </alarm>
            <alarm>
                <eventType>ADMIN</eventType>
                <number>2</number>
                <probableCause>CONFIG_OR_CUSTOMIZATION_ERROR</probableCause>
                <descTemplate>
                    <msg>Deployment for audiocode server failed: $1. Reason: $2.</msg>
                     <param>
                         <num>1</num>
                         <description>AudioCode Name</description>
                         <exampleValue>audcod</exampleValue>
                     </param>
                     <param>
                         <num>2</num>
                         <description>AudioCode Deployment failed reason</description>
                         <exampleValue>Failed to parse audiocode configuration data</exampleValue>
                     </param>
               </descTemplate>
               <manualClearable></manualClearable>
               <correctiveAction>Check the configuration of audiocode server. Acknowledge/Clear alarm and deploy the audiocode server if appropriate.</correctiveAction>
               <alarmName>Audiocode Server Deploy Failed</alarmName>
               <severities>
                     <severity>MINOR</severity> 
                     <severity>MAJOR</severity>
               </severities>               
            </alarm>
            <alarm>
                <eventType>COMM</eventType>
                <number>2</number>
                <probableCause>LOSS_OF_FRAME</probableCause>
                <descTemplate>
                    <msg>Far end LOF (a.k.a., Yellow Alarm). Trunk (DS1 Number): $1.</msg>
                     <param>
                         <num>1</num>
                         <description>Trunk Number of Trunk with configuration problem</description>
                         <exampleValue>2</exampleValue>
                     </param>
               </descTemplate>
               <clearCondition>Far end is correctly configured for proper framing.</clearCondition>
               <correctiveAction>Check that the far end is configured for the proper framing.</correctiveAction>
               <alarmName>Far end LOF</alarmName>
               <severities>
                     <severity>CRITICAL</severity>
               </severities>
               <note>This alarm indicates the Trunk Framing settings on the connected PSTN switch do not match those provisioned on the Audiocodes Mediant 2k.</note>
            </alarm>
            <alarm>
                <eventType>COMM</eventType>
                <number>3</number>
                <probableCause>LOSS_OF_FRAME</probableCause>
                <descTemplate>
                    <msg>Near end sending LOF Indication. Trunk (DS1 Number): $1.</msg>
                     <param>
                         <num>1</num>
                         <description>Trunk Number of Trunk with configuration problem</description>
                         <exampleValue>2</exampleValue>
                     </param>
               </descTemplate>
               <clearCondition>Gateway is correctly configured for proper framing.</clearCondition>
               <correctiveAction>Check that the Audiocodes gateway is configured for the proper framing.</correctiveAction>
               <alarmName>Near end sending LOF Indication</alarmName>
               <severities>
                     <severity>CRITICAL</severity>
               </severities>               
            </alarm>
        </alarms>
    </family>
</faults>

这是代码,你可以看到我的标签名称是硬编码的:

from xml.etree import ElementTree
import csv
import lxml.etree
import pandas as pd
from copy import copy
from pprint import pprint


tree = ElementTree.parse('FaultFamilies.xml')


sitescope_data = open('Out.csv', 'w', newline='', encoding='utf-8')
csvwriter = csv.writer(sitescope_data)

# Create all needed columns here in order and writes them to excel file
col_names = ['longName', 'shortName', 'eventType', 'ProbableCause', 'Severity', 'alarmName', 'clearCondition',
             'correctiveAction', 'note', 'action', 'om']
csvwriter.writerow(col_names)



def recurse(root, props):

    # Finds every single tag in the xml file
    for child in root:
        #print(child.text)
        if child.tag == '{urn:nortel:namespaces:mcp:faults}family':
            # copy of the dictionary
            p2 = copy(props)

            # adds to the dictionary the longNm name and shortName
            p2['longName'] = child.attrib.get('longName', '')
            p2['shortName'] = child.attrib.get('shortName', '')
            recurse(child, p2)
        else:
            recurse(child, props)

    # FIND ALL NEEDED ALARMS INFORMATION
    for event in root.findall('{urn:nortel:namespaces:mcp:faults}alarm'):

        event_data = [props.get('longName',''), props.get('shortName', '')]

        # Find eventType and appends it
        event_id = event.find('{urn:nortel:namespaces:mcp:faults}eventType')
        if event_id != None:
            event_id = event_id.text
        # appends to the to the list with comma
        event_data.append(event_id)

        # Find probableCause and appends it
        probableCause = event.find('{urn:nortel:namespaces:mcp:faults}probableCause')
        if probableCause != None:
            probableCause = probableCause.text
        event_data.append(probableCause)

        # Find severities and appends it
        severities = event.find('{urn:nortel:namespaces:mcp:faults}severities')
        if severities:
            severity_data = ','.join(
                [sv.text for sv in severities.findall('{urn:nortel:namespaces:mcp:faults}severity')])
            event_data.append(severity_data)
        else:
            event_data.append("")

        # Find alarmName and appends it
        alarmName = event.find('{urn:nortel:namespaces:mcp:faults}alarmName')
        if alarmName != None:
            alarmName = alarmName.text
        event_data.append(alarmName)

        clearCondition = event.find('{urn:nortel:namespaces:mcp:faults}clearCondition')
        if clearCondition != None:
            clearCondition = clearCondition.text
        event_data.append(clearCondition)

        correctiveAction = event.find('{urn:nortel:namespaces:mcp:faults}correctiveAction')
        if correctiveAction != None:
            correctiveAction = correctiveAction.text
        event_data.append(correctiveAction)

        note = event.find('{urn:nortel:namespaces:mcp:faults}note')
        if note != None:
            note = note.text
        event_data.append(note)

        action = event.find('{urn:nortel:namespaces:mcp:faults}action')
        if action != None:
            action = action.text
        event_data.append(action)

        csvwriter.writerow(event_data)

    # FIND ALL LOGS INFORMATION
    for event in root.findall('{urn:nortel:namespaces:mcp:faults}log'):
        event_data = [props.get('longName', ''), props.get('shortName', '')]

        event_id = event.find('{urn:nortel:namespaces:mcp:faults}eventType')
        if event_id != None:
            event_id = event_id.text
        event_data.append(event_id)

        probableCause = event.find('{urn:nortel:namespaces:mcp:faults}probableCause')
        if probableCause != None:
            probableCause = probableCause.text
        event_data.append(probableCause)

        severities = event.find('{urn:nortel:namespaces:mcp:faults}severity')
        if severities != None:
            severities = severities.text
        event_data.append(severities)

        alarmName = event.find('{urn:nortel:namespaces:mcp:faults}alarmName')
        if alarmName != None:
            alarmName = alarmName.text
        event_data.append(alarmName)

        # Find alarmName and appends it
        clearCondition = event.find('{urn:nortel:namespaces:mcp:faults}clearCondition')
        if clearCondition != None:
            clearCondition = clearCondition.text
        event_data.append(clearCondition)

        correctiveAction = event.find('{urn:nortel:namespaces:mcp:faults}correctiveAction')
        if correctiveAction != None:
            correctiveAction = correctiveAction.text
        event_data.append(correctiveAction)

        note = event.find('{urn:nortel:namespaces:mcp:faults}note')
        if note != None:
            note = note.text
        event_data.append(note)

        action = event.find('{urn:nortel:namespaces:mcp:faults}action')
        if action != None:
            action = action.text
        event_data.append(action)
        csvwriter.writerow(event_data)


root = tree.getroot()
recurse(root, {})  # root + empty dictionary
print("File successfuly converted to CSV")
sitescope_data.close()

运行@tdelaney 解决方案时:

【问题讨论】:

  • 你为什么用alarmName复制粘贴所有这些块?您可以遍历它必须查找的名称,对吗?
  • 是的,这只是一个测试。如果硬编码是解析的唯一方法,我会确定它。我正在尝试找到一种无需硬编码即可将所有标签放入列的方法,因为此 xml 会随着新标签的变化而超时。
  • 遗憾的是,我不熟悉您在这里使用的库,但我看到您正在对节点进行递归,我认为这足以保留您所有唯一值的 set()相遇对吧?
  • 啊,自从我昨晚查看并开始尝试编写解决方案以来,您已经完全改变了 XML。
  • 嘿@barny 这是一个老问题。请看这个:stackoverflow.com/questions/64407201/…

标签: python parsing beautifulsoup lxml elementtree


【解决方案1】:

您可以构建一个列表列表来表示表格的行。每当需要新行时,构建一个所有已知列默认为"" 的新列表,并将其附加到外部列表的底部。当需要插入一个新列时,它只是在现有的内部列表中旋转并附加一个默认的"" 单元格。保留已知列名的映射以在行中建立索引。现在,当您浏览事件时,您可以使用标签名称来查找行索引并将其值添加到表中的最新行。

看起来你想要“log”和“alarm”标签,但我编写了元素选择器来获取任何具有“eventType”子元素的元素。由于“longName”和“shortName”对给定下的所有事件都是通用的,因此有一个外部循环来获取它们并应用于表的每个新行。我切换到xpath,以便我可以设置命名空间并更简洁地编写选择器。那里有个人偏好,但我认为它使 xpath 更具可读性。

import csv
import lxml.etree
from lxml.etree import QName
import operator

class ExpandingTable:
    """A 2 dimensional table where columns are exapanded as new column
    types are discovered"""

    def __init__(self):
        """Create table that can expand rows and columns"""
        self.name_to_col = {}
        self.table = []
    
    def add_column(self, name):
        """Add column named `name` unless already included"""
        if name not in self.name_to_col:
            self.name_to_col[name] = len(self.name_to_col)
            for row in self.table:
                row.append('')
    
    def add_cell(self, name, value):
        """Add value to named column in the current row"""
        if value:
            self.add_column(name)
            self.table[-1][self.name_to_col[name]] = value.strip().replace("\r\n", " ")
            
    def new_row(self):
        """Create a new row and make it current"""
        self.table.append([''] * len(self.name_to_col))

    def header(self):
        """Gather discovered column names into a header list"""
        idx_1 = operator.itemgetter(1)
        return [name for name, _ in sorted(self.name_to_col.items(), key=idx_1)]

    def prepend_header(self):
        """Gather discovered column names into a header and
        prepend it to the list"""
        self.table.insert(0, self.header())

def events_to_table(elem):
    """ Builds table from <family> child elements and their contained alarms and
    logs."""
    ns = {"f":"urn:nortel:namespaces:mcp:faults"}
    table = ExpandingTable()
    for family in elem.xpath("f:family", namespaces=ns):
        longName = family.get("longName")
        shortName = family.get("shortName")
        for event in family.xpath("*/*[f:eventType]", namespaces=ns):
            table.new_row()
            table.add_cell("longName", longName)
            table.add_cell("shortName", shortName)
            for cell in event:
                tag = QName(cell.tag).localname
                if tag == "severities":
                    tag = "severity"
                    text = ",".join(severity.text for severity in cell.xpath("*"))
                    print("severities", repr(text))
                else:
                    text = cell.text
                table.add_cell(tag, text)
    table.prepend_header()
    return table.table
    
def main(filename):
    doc = lxml.etree.parse(filename)
    table = events_to_table(doc.getroot())
    with open('test.csv', 'w', newline='', encoding='utf-8') as fileobj:
        csv.writer(fileobj).writerows(table)

main('test.xml')

【讨论】:

  • 有趣。它在 linux 上的 libreoffice 上加载良好。我没有excel所以不能直接测试。这些行以\r\n 结尾,并且文本本身也有终止符。它们应该被转义并且 csv 阅读器应该弄清楚,但是......我添加了一些代码来清除值,方法是在添加之前删除嵌入的换行符。这样效果更好吗?
  • 可能是我的菜鸟错误。打开文件时我没有添加newline=None。在可能导致换行符奇怪的窗口上,例如 "\r\r\n" 混淆所有人。
  • 那是在更改为open('test.csv', 'w', newline=None, encoding='utf-8') 之后? csv.writer 默认为 Excel 方言,这应该会让 Excel 满意。不确定是什么问题。也许encoding='utf-8-sig' 在 Windows 上添加 BOM 会有所帮助。
  • 哦,对了!我以为newline=None 做同样的事情,但实际上这是\r\n 在Windows 上的默认设置。你的方法是正确的,我会修正这个例子。
  • 已更新以获取名称。
猜你喜欢
  • 2021-02-10
  • 2021-02-14
  • 1970-01-01
  • 2013-05-13
  • 2012-08-21
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多