解析复杂的 XML 并写入 CSV答案

【问题标题】：Parsing complex XML and writing to CSV解析复杂的 XML 并写入 CSV
【发布时间】：2015-05-25 02:53:42
【问题描述】：

我正在尝试解析一个相对复杂的（无论如何对我而言！）XML 文件。我以前在类似的主题中发过帖子，对此有所了解。然而，这给我带来了问题。我的 XML 文件的摘录：

<?xml version="1.0" ?>
<record number="1" type="custID" first-time="Wed Feb  4 19:22:57 2014" last-time="Fri Feb  7 10:11:02 2015">
    <Customer name="Bob Janotior" custID="4466851">
        <type>Monthly</type>
        <max-books>5</max-books>
        <rental status="false">overdue</essid>
    </Customer>
    <book title="All The Things" type="fiction" author="Jill Taylor" pubID="7744jh566lp">
      <cover>softback</cover>
      <pub>Penguin</pub>
    </book>
    <book title="Mellow Tides of War" type="non-fiction" author="Prof. Lambert et al" pubID="7744gd556se">
      <cover>hardback</cover>
      <pub>Penguin</pub>
    </book>
</record>   
<record number="2" type="custID" first-time="Wed Apr  8 15:23:54 2012" last-time="Fri Feb  7 10:11:02 2015">
    <Customer name="Jayne Wrikcek" custID="4466787">
        <type>Monthly</type>
        <max-books>5</max-books>
        <rental status="false">overdue</essid>
    </Customer>
    <book title="Kiss Me Hardy" type="fiction" author="AR Jones" pubID="766485gf66ki">
      <cover>softback</cover>
      <pub>/Kingsoft</pub>
    </book>
    <book title="Oskar Came Again" type="fiction" author="Johnathan Huphries" pubID="a5555qwd2">
      <cover>hardback</cover>
      <pub>Lofthouse</pub>
    </book>
</record>

所以之前我使用的是我在 Python 2.7 中编写的这个脚本：

from xml.dom.minidom import parse
import xml.dom.minidom
import csv

def writeToCSV(myLibrary):
    with open('output.csv', 'wb') as csvfile:
        writer = csv.writer(csvfile, delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
        writer.writerow(['title', 'author', 'author'])
        books = myLibrary.getElementsByTagName("book")
        for book in books:
            titleValue = book.getElementsByTagName("title")[0].childNodes[0].data
            authors = [] # get all the authors in a vector
            for author in book.getElementsByTagName("author"):
                authors.append(author.childNodes[0].data)
            writer.writerow([titleValue] + authors) # write to csv

doc = parse('library.xml')
myLibrary = doc.getElementsByTagName("library")[0]
# Print each book's title
writeToCSV(myLibrary)

这个脚本实际上是为一个更简单的 XML 文件编写的。我很难为这个 XML 文件调整它，它（对我来说）结构要复杂得多。我正在慢慢掌握 minidom 和 csv 写作，但这对我来说仍然是新的。这是我想要的 CSV 文件中的那种输出：

这就是我想要的 CSV 文件中的输出类型：

record number,type,Customer name,CustID,type,max-books,rental status,book,title,type,author,
1,custID,Bob Janotoir,4466851,Monthly,5,false,overdue,All The Things,fiction,Jill Taylor,
2,custID,Jayne Wrikcek,4466787,Monthly,5,false,overdue,Kiss Me Hardy,fiction,AR Jones,

【问题讨论】：

标签： python xml python-2.7 csv minidom

【解决方案1】：

这是我的 XML 到 CSV 的版本

我创建了一个字典，在其中递归地附加每个 xml 记录的有序项。该代码考虑了具有相同名称的xml子，并将它们重命名为child、child2、child3等。

希望这会有所帮助：

XML 文件： （mdoified -> 添加了根节点“树”，将</essid>改为</rental>）

<tree>
    <record number="1" type="custID" first-time="Wed Feb  4 19:22:57 2014" last-time="Fri Feb  7 10:11:02 2015">
        <Customer name="Bob Janotior" custID="4466851">
            <type>Monthly</type>
            <max-books>5</max-books>
            <rental status="false">overdue</rental>
        </Customer>
        <book title="All The Things" type="fiction" author="Jill Taylor" pubID="7744jh566lp">
          <cover>softback</cover>
          <pub>Penguin</pub>
        </book>
        <book title="Mellow Tides of War" type="non-fiction" author="Prof. Lambert et al" pubID="7744gd556se">
          <cover>hardback</cover>
          <pub>Penguin</pub>
        </book>
    </record>
    <record number="2" type="custID" first-time="Wed Apr  8 15:23:54 2012" last-time="Fri Feb  7 10:11:02 2015">
        <Customer name="Jayne Wrikcek" custID="4466787">
            <type>Monthly</type>
            <max-books>5</max-books>
            <rental status="false">overdue</rental>
        </Customer>
        <book title="Kiss Me Hardy" type="fiction" author="AR Jones" pubID="766485gf66ki">
          <cover>softback</cover>
          <pub>/Kingsoft</pub>
        </book>
        <book title="Oskar Came Again" type="fiction" author="Johnathan Huphries" pubID="a5555qwd2">
          <cover>hardback</cover>
          <pub>Lofthouse</pub>
        </book>
    </record>
</tree>

代码：

from collections import defaultdict, OrderedDict
from xml.etree import ElementTree as etree
import csv

# takes as input an xml root, a dictionary where to store the parsed values and an id number suggesting uniqueness of the current node
def parse_node(root, dict, id):
    # Parse this node
    tag_dict = OrderedDict()
    for key, value in root.attrib.items():
        if id > 1: # if there are more than one childs with the same tag
            tag_dict[root.tag + str(id) + ':' + key] = value
        else:
            tag_dict[root.tag + ':' + key] = value
    # Get children of node
    children = root.getchildren()
    # If node has one or more child
    if len(children) >= 1:
        # Loop through all the children
        tag_dict_id = defaultdict(lambda: 0)
        for child in children:
            tag_dict_id[child.tag] += 1 # keep track of the children
            # call to recursion function
            # Parse children
            parse_node(child, tag_dict, tag_dict_id[child.tag])
    # If does not have children and is the 'search_node'
    elif len(children) == 0:
        # Store the text inside the node.
        if id > 1:
            tag_dict[root.tag + str(id) + ':text'] = root.text
        else:
            tag_dict[root.tag + ':text'] = root.text
    # update the current dictionary with the new data
    dict.update(tag_dict)
    return dict

# Input: an xml root node. Output: 'output.csv'
def writeToCSV(records_lib):
    records_list = [] # contains each of the records
    with open('output.csv', 'wb') as csvfile:
        header = OrderedDict() # dictionary with the csv header
        for record in records_lib:
            parsed_record = parse_node(record, OrderedDict(), 1)
            for x in parsed_record.keys():
                header[x] = x
            records_list.append(parsed_record)
        writer = csv.DictWriter(csvfile, fieldnames=header.keys())
        writer.writerow(header)
        for record in records_list:
            writer.writerow(record)


doc = etree.parse('library.xml')
root = doc.getroot()
writeToCSV(root)

输出：

record:first-time,record:last-time,record:type,record:number,Customer:custID,Customer:name,type:text,max-books:text,rental:status,rental:text,book:title,book:pubID,book:type,book:author,cover:text,pub:text,book2:title,book2:pubID,book2:type,book2:author
Wed Feb  4 19:22:57 2014,Fri Feb  7 10:11:02 2015,custID,1,4466851,Bob Janotior,Monthly,5,false,overdue,All The Things,7744jh566lp,fiction,Jill Taylor,hardback,Penguin,Mellow Tides of War,7744gd556se,non-fiction,Prof. Lambert et al
Wed Apr  8 15:23:54 2012,Fri Feb  7 10:11:02 2015,custID,2,4466787,Jayne Wrikcek,Monthly,5,false,overdue,Kiss Me Hardy,766485gf66ki,fiction,AR Jones,hardback,Lofthouse,Oskar Came Again,a5555qwd2,fiction,Johnathan Huphries

亲切的问候，

【讨论】：