【问题标题】:Parsing complex XML and writing to CSV解析复杂的 XML 并写入 CSV
【发布时间】:2015-05-25 02:53:42
【问题描述】:

我正在尝试解析一个相对复杂的(无论如何对我而言!)XML 文件。我以前在类似的主题中发过帖子,对此有所了解。然而,这给我带来了问题。我的 XML 文件的摘录:

<?xml version="1.0" ?>
<record number="1" type="custID" first-time="Wed Feb  4 19:22:57 2014" last-time="Fri Feb  7 10:11:02 2015">
    <Customer name="Bob Janotior" custID="4466851">
        <type>Monthly</type>
        <max-books>5</max-books>
        <rental status="false">overdue</essid>
    </Customer>
    <book title="All The Things" type="fiction" author="Jill Taylor" pubID="7744jh566lp">
      <cover>softback</cover>
      <pub>Penguin</pub>
    </book>
    <book title="Mellow Tides of War" type="non-fiction" author="Prof. Lambert et al" pubID="7744gd556se">
      <cover>hardback</cover>
      <pub>Penguin</pub>
    </book>
</record>   
<record number="2" type="custID" first-time="Wed Apr  8 15:23:54 2012" last-time="Fri Feb  7 10:11:02 2015">
    <Customer name="Jayne Wrikcek" custID="4466787">
        <type>Monthly</type>
        <max-books>5</max-books>
        <rental status="false">overdue</essid>
    </Customer>
    <book title="Kiss Me Hardy" type="fiction" author="AR Jones" pubID="766485gf66ki">
      <cover>softback</cover>
      <pub>/Kingsoft</pub>
    </book>
    <book title="Oskar Came Again" type="fiction" author="Johnathan Huphries" pubID="a5555qwd2">
      <cover>hardback</cover>
      <pub>Lofthouse</pub>
    </book>
</record>

所以之前我使用的是我在 Python 2.7 中编写的这个脚本:

from xml.dom.minidom import parse
import xml.dom.minidom
import csv

def writeToCSV(myLibrary):
    with open('output.csv', 'wb') as csvfile:
        writer = csv.writer(csvfile, delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
        writer.writerow(['title', 'author', 'author'])
        books = myLibrary.getElementsByTagName("book")
        for book in books:
            titleValue = book.getElementsByTagName("title")[0].childNodes[0].data
            authors = [] # get all the authors in a vector
            for author in book.getElementsByTagName("author"):
                authors.append(author.childNodes[0].data)
            writer.writerow([titleValue] + authors) # write to csv

doc = parse('library.xml')
myLibrary = doc.getElementsByTagName("library")[0]
# Print each book's title
writeToCSV(myLibrary)

这个脚本实际上是为一个更简单的 XML 文件编写的。我很难为这个 XML 文件调整它,它(对我来说)结构要复杂得多。我正在慢慢掌握 minidom 和 csv 写作,但这对我来说仍然是新的。这是我想要的 CSV 文件中的那种输出:

这就是我想要的 CSV 文件中的输出类型:

record number,type,Customer name,CustID,type,max-books,rental status,book,title,type,author,
1,custID,Bob Janotoir,4466851,Monthly,5,false,overdue,All The Things,fiction,Jill Taylor,
2,custID,Jayne Wrikcek,4466787,Monthly,5,false,overdue,Kiss Me Hardy,fiction,AR Jones,

【问题讨论】:

    标签: python xml python-2.7 csv minidom


    【解决方案1】:

    这是我的 XML 到 CSV 的版本

    我创建了一个字典,在其中递归地附加每个 xml 记录的有序项。该代码考虑了具有相同名称的xml子,并将它们重命名为child、child2、child3等。

    希望这会有所帮助:

    XML 文件: (mdoified -> 添加了根节点“树”,将&lt;/essid&gt;改为&lt;/rental&gt;

    <tree>
        <record number="1" type="custID" first-time="Wed Feb  4 19:22:57 2014" last-time="Fri Feb  7 10:11:02 2015">
            <Customer name="Bob Janotior" custID="4466851">
                <type>Monthly</type>
                <max-books>5</max-books>
                <rental status="false">overdue</rental>
            </Customer>
            <book title="All The Things" type="fiction" author="Jill Taylor" pubID="7744jh566lp">
              <cover>softback</cover>
              <pub>Penguin</pub>
            </book>
            <book title="Mellow Tides of War" type="non-fiction" author="Prof. Lambert et al" pubID="7744gd556se">
              <cover>hardback</cover>
              <pub>Penguin</pub>
            </book>
        </record>
        <record number="2" type="custID" first-time="Wed Apr  8 15:23:54 2012" last-time="Fri Feb  7 10:11:02 2015">
            <Customer name="Jayne Wrikcek" custID="4466787">
                <type>Monthly</type>
                <max-books>5</max-books>
                <rental status="false">overdue</rental>
            </Customer>
            <book title="Kiss Me Hardy" type="fiction" author="AR Jones" pubID="766485gf66ki">
              <cover>softback</cover>
              <pub>/Kingsoft</pub>
            </book>
            <book title="Oskar Came Again" type="fiction" author="Johnathan Huphries" pubID="a5555qwd2">
              <cover>hardback</cover>
              <pub>Lofthouse</pub>
            </book>
        </record>
    </tree>
    

    代码:

    from collections import defaultdict, OrderedDict
    from xml.etree import ElementTree as etree
    import csv
    
    # takes as input an xml root, a dictionary where to store the parsed values and an id number suggesting uniqueness of the current node
    def parse_node(root, dict, id):
        # Parse this node
        tag_dict = OrderedDict()
        for key, value in root.attrib.items():
            if id > 1: # if there are more than one childs with the same tag
                tag_dict[root.tag + str(id) + ':' + key] = value
            else:
                tag_dict[root.tag + ':' + key] = value
        # Get children of node
        children = root.getchildren()
        # If node has one or more child
        if len(children) >= 1:
            # Loop through all the children
            tag_dict_id = defaultdict(lambda: 0)
            for child in children:
                tag_dict_id[child.tag] += 1 # keep track of the children
                # call to recursion function
                # Parse children
                parse_node(child, tag_dict, tag_dict_id[child.tag])
        # If does not have children and is the 'search_node'
        elif len(children) == 0:
            # Store the text inside the node.
            if id > 1:
                tag_dict[root.tag + str(id) + ':text'] = root.text
            else:
                tag_dict[root.tag + ':text'] = root.text
        # update the current dictionary with the new data
        dict.update(tag_dict)
        return dict
    
    # Input: an xml root node. Output: 'output.csv'
    def writeToCSV(records_lib):
        records_list = [] # contains each of the records
        with open('output.csv', 'wb') as csvfile:
            header = OrderedDict() # dictionary with the csv header
            for record in records_lib:
                parsed_record = parse_node(record, OrderedDict(), 1)
                for x in parsed_record.keys():
                    header[x] = x
                records_list.append(parsed_record)
            writer = csv.DictWriter(csvfile, fieldnames=header.keys())
            writer.writerow(header)
            for record in records_list:
                writer.writerow(record)
    
    
    doc = etree.parse('library.xml')
    root = doc.getroot()
    writeToCSV(root)
    

    输出:

    record:first-time,record:last-time,record:type,record:number,Customer:custID,Customer:name,type:text,max-books:text,rental:status,rental:text,book:title,book:pubID,book:type,book:author,cover:text,pub:text,book2:title,book2:pubID,book2:type,book2:author
    Wed Feb  4 19:22:57 2014,Fri Feb  7 10:11:02 2015,custID,1,4466851,Bob Janotior,Monthly,5,false,overdue,All The Things,7744jh566lp,fiction,Jill Taylor,hardback,Penguin,Mellow Tides of War,7744gd556se,non-fiction,Prof. Lambert et al
    Wed Apr  8 15:23:54 2012,Fri Feb  7 10:11:02 2015,custID,2,4466787,Jayne Wrikcek,Monthly,5,false,overdue,Kiss Me Hardy,766485gf66ki,fiction,AR Jones,hardback,Lofthouse,Oskar Came Again,a5555qwd2,fiction,Johnathan Huphries
    

    亲切的问候,

    【讨论】:

      猜你喜欢
      • 2015-05-24
      • 1970-01-01
      • 1970-01-01
      • 2020-01-09
      • 2018-02-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多