【问题标题】:Python Iterating through a directory xml filesPython遍历目录xml文件
【发布时间】:2023-10-18 18:42:01
【问题描述】:

* 用户帮助我将 XML 文件解析为 csv 文件,这对我帮助很大。这很有帮助,我决定尝试遍历整个目录并将所有 xml 文件解析为单个 csv 文件。这是有效的代码:

import xml.etree.ElementTree as ET
import csv

extra_columns = 2

fields = [
('Id_Customer', 'Id_Customer', 1),
('Segment', 'Segment', 1),
('Nature', 'Event/Nature', 1),
('Extrainfo', 'Event/Extrainfo', 1),
('zipcode', 'Adress/zipcode', extra_columns),
('street', 'Adress/street', extra_columns),
('number', 'Adress/number', extra_columns)]

tree = ET.parse('cat.xml')
root = tree.getroot()

# Auto create the header from fields
fieldnames = []

for field, match, cols in fields:
   fieldnames.append(field)

if cols > 1:
    fieldnames.extend(["{}{}".format(field, x+2) for x in range(extra_columns)])

with open(r'customerdata.csv', 'wb') as f_customerdata:
    csv_customerdata = csv.DictWriter(f_customerdata, fieldnames=fieldnames)
    csv_customerdata.writeheader()

for node in tree.iter('Customer'):
    row = {}

    for field_name, match, cols in fields:
        if cols > 1:
            for index, el in enumerate(node.findall(match)):
                try:
                    if index:
                        row["{}{}".format(field_name, index+1)] = el.text
                    else:
                        row[field_name] = el.text

                except AttributeError as e:
                    row[field_name] = ''
        else:
            try:
                row[field_name] = node.find(match).text
            except AttributeError as e:
                row[field_name] = ''

    csv_customerdata.writerow(row)

然后我尝试引入listdir来查找所有文件名(这一步有效):

for filename in os.path.join(r'C:\docs', filename):
    if not filename.endswith('.xml'): continue
    fullname = os.path.join(r'C:\docs', filename)

但是,当我尝试集成这些步骤时,我发现我只能从目录中的第一个 xml 文件中获取数据。请参阅下面的合并代码。我试图弄清楚为什么我的 for 循环没有迭代并将每个解析的 xml 文件写入 csv 文件。

import xml.etree.ElementTree as ET
import csv
import os

extra_columns = 2

fields = [
('Id_Customer', 'Id_Customer', 1),
('Segment', 'Segment', 1),
('Nature', 'Event/Nature', 1),
('Extrainfo', 'Event/Extrainfo', 1),
('zipcode', 'Adress/zipcode', extra_columns),
('street', 'Adress/street', extra_columns),
('number', 'Adress/number', extra_columns)]

#tree = ET.parse('cat.xml')
#root = tree.getroot()

# Auto create the header from fields
fieldnames = []

with open(r'customerdata.csv', 'wb') as f_customerdata:
    csv_customerdata = csv.DictWriter(f_customerdata, fieldnames=fieldnames)
    csv_customerdata.writeheader()

    for filename in os.listdir(r'C:\docs'):
        if not filename.endswith('.xml'): continue
        fullname = os.path.join(r'C:\docs',filename)
        tree = ET.parse(fullname)
        root = tree.getroot()

        for node in tree.iter('Customer'):
            row = {}

            for field_name, match, cols in fields:
                if cols > 1:
                    for index, el in enumerate(node.findall(match)):
                        try:
                            if index:
                                row["{}{}".format(field_name, index+1)] = el.text
                            else:
                                row[field_name] = el.text

                        except AttributeError as e:
                            row[field_name] = ''
            else:
                try:
                    row[field_name] = node.find(match).text
                except AttributeError as e:
                    row[field_name] = ''

   csv_customerdata.writerow(row)

更新:

with open(r'customerdata.csv', 'wb') as f_customerdata:
    csv_customerdata = csv.DictWriter(f_customerdata, fieldnames=fieldnames)
    csv_customerdata.writeheader()

for filename in os.listdir(r'C:\docs'):
     if not filename.endswith('.xml'): continue
     fullname = os.path.join(r'C:\docs',filename)
     tree = ET.parse(fullname)
     root = tree.getroot()

     for node in tree.iter('Customer'):
         row = {}
         for field_name, match, cols in fields:
             if cols > 1:
                 for index, el in enumerate(node.findall(match)):
                     try:
                         if index:
                             row["{}{}".format(field_name, index+1)] = el.text
                         else:
                             row[field_name] = el.text

                     except AttributeError as e:
                         row[field_name] = ''
            else:
                try:
                    row[field_name] = node.find(match).text
                except AttributeError as e:
                    row[field_name] = ''

       csv_customerdata.writerow(row)

【问题讨论】:

    标签: python xml csv


    【解决方案1】:

    您正在文件循环之外写入 CSV。以下行应该在遍历文件的循环中:

    csv_customerdata.writerow(row)
    

    【讨论】:

    • 由于我已经更新了将 csv_customerdata.writerow(row) 移动到文件名 for 循环中,我现在从每个 xml 文件中获取数据。但是,如果 xml 文件中有 2 个客户,它只会拉 1 个客户,然后移动到下一个文件。如何从 xml 文件中提取所有客户,然后转到下一个客户?
    • 记住每个文件可以有不同数量的客户。文件 1 可能有 5 个客户,文件 2 可能有 1 个,文件 3 可能有 2 个,依此类推。
    • 同样的问题。将其移入循环中。
    • 对不起,当我查看代码时,我看到它在循环中,所以很明显我没有关注。您能否使用上面的代码提供一个示例,以便我将其可视化?谢谢。
    • 它不在内部循环中。您应该逐行跟踪代码,最好使用调试器,看看发生了什么。追踪后应该很明显。
    【解决方案2】:

    不要考虑单个 for 循环(除了文件循环)、if 逻辑或辅助字典/列表对象。只需运行 XSLT,这种转换语言可以将 XML 转换为其他 XML、HTML,尤其适用于您的 CSV/TXT 文件。

    但是,为了运行 XSLT 1.0 脚本,您需要 Python 的第三方模块 lxml,而不是内置的 etree。有external processors Python 可以调用,即使是内置的,例如Window 的System.Xml.Xsl 和Linux/Mac 的xsltproc

    XSLT (另存为 .xsl,另一个特殊的格式良好的 .xml 文件)

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output indent="yes" method="text"/>
      <xsl:strip-space elements="*"/>
    
      <xsl:template match="/CAT">
        <xsl:text>Id_Customer,Segment,Nature,Extrainfo,zipcode,zipcode2,zipcode3,street,street2,street3,number,number2,number3&#xa;</xsl:text>
        <xsl:apply-templates select="Customer"/>            
      </xsl:template>  
    
      <xsl:template match="Customer">    
          <xsl:value-of select="concat(Id_Customer, ', ', Segment, ', ', Event/Nature, ', ', Event/Extrainfo, ', ', 
                                       Address[1]/zipcode, ', ', Address[2]/zipcode, ', ', Address[3]/zipcode, ', ', 
                                       Address[1]/street, ', ', Address[2]/street, ', ', Address[3]/street, ', ', 
                                       Address[1]/number, ', ', Address[2]/number, ', ', Address[3]/number)"/><xsl:text>&#xa;</xsl:text>      
      </xsl:template>
    
    </xsl:stylesheet>
    

    Python (通过此过程集成您的 XML)

    import lxml.etree as et
    
    # LOAD XML AND XSL
    doc = et.parse('Input.xml')
    xsl = et.parse('XSLTScript.xsl')
    
    # TRANSFORM XSLT
    transform = et.XSLT(xsl)
    result = transform(doc)
    
    # SAVE RESULT TO FILE
    with open('Output.csv', 'w') as f:
        f.write(str(result))
    

    输入示例(从以前的问题帖子扩展)

    <?xml version="1.0"?>
    <CAT>
        <Header>...</Header>
        <Add>...</Add>
        <Customer>
            <Id_Customer>xyz1</Id_Customer>
            <Segment>abc1</Segment>
            <Event>
                <Nature>info1</Nature>
                <Extrainfo>info2</Extrainfo>
            </Event>
        </Customer>
        <Customer>
            <Id_Customer>zzwy</Id_Customer>
            <Segment>c2</Segment>
            <Address>
                <zipcode>99999</zipcode>
                <street>belaire drive</street>
                <number>5</number>
            </Address>
            <Address>
                <zipcode>88888</zipcode>
                <street>pennsylvania ave</street>
                <number>10</number>
            </Address>
            <Address>
                <zipcode>77777</zipcode>
                <street>main street</street>
                <number>15</number>
            </Address>
        </Customer>
    </CAT>
    

    输出

    Id_Customer,Segment,Nature,Extrainfo,zipcode,zipcode2,zipcode3,street,street2,street3,number,number2,number3
    xyz1, abc1, info1, info2, , , , , , , , , 
    zzwy, c2, , , 99999, 88888, 77777, belaire drive, pennsylvania ave, main street, 5, 10, 15
    

    【讨论】: