在python中解析xml（和html）答案

【问题标题】：xml (and html) parsing in python在python中解析xml（和html）
【发布时间】：2016-03-01 11:17:27
【问题描述】：

我最近开始研究 python。我正在尝试解析 xml 文档。考虑以下 xml 文件以供参考：

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
</catalog>

在这里我想检索第一个 book 标记及其所有内容，即

<book id="bk101">
  <author>Gambardella, Matthew</author>
  <title>XML Developer's Guide</title>
  <genre>Computer</genre>
  <price>44.95</price>
  <publish_date>2000-10-01</publish_date>
  <description>An in-depth look at creating applications
  with XML.</description>
</book>

我来自 scala 背景，在那里我可以轻松做到这一点

val node = scala.xml.XML.loadString(str)
val nodeSeq = node \\ "book"
nodeSeq.head.toString()

我曾尝试使用lxml 和xpath 来执行此操作，但它会变得复杂（递归地获取嵌套元素的内容）以实现上述要求。在python中有没有简单的方法来做到这一点？也可以扩展为html吗？

TIA

【问题讨论】：

您是否尝试过使用 minidom，对于有 Scala 或 Java 背景的人来说，它可能是最简单的软件包。

标签： python xml xml-parsing html-parsing

【解决方案1】：

使用lxml 和 xpath

from lxml import etree

data = """<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
</catalog>"""

tree = etree.fromstring(data)
book = tree.xpath("//catalog/book") #or book = tree.xpath("(//catalog/book)[1]")
for i in book[0]:#[0] means first book
    print etree.tostring(i)

输出-

<book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>

【讨论】：

【解决方案2】：

这是仅提取第一本书的 XPath：

//catalog/book[1]

这是返回所需结果的完整代码：

from lxml import html

XML = """<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
</catalog>"""

tree = html.fromstring(XML)
first_book = tree.xpath('//catalog/book[1]')[0]
book_id = first_book.xpath('@id')[0]
author = first_book.xpath('.//author/text()')[0]
title = first_book.xpath('.//title/text()')[0]
genre = first_book.xpath('.//genre/text()')[0]
price = first_book.xpath('.//price/text()')[0]
publish_date = first_book.xpath('.//publish_date/text()')[0]
description = first_book.xpath('.//description/text()')[0].replace('\n',' ').replace('  ','')

print """Book Id:\t\t{}
Author:\t\t\t{}
Title:\t\t\t{}
Genre:\t\t\t{}
Price:\t\t\t{}
Publish Date:\t{}
Description:\t{}""".format(book_id,author,title,genre,price,publish_date,description)

输出：

Book Id:        bk101
Author:         Gambardella, Matthew
Title:          XML Developer's Guide
Genre:          Computer
Price:          44.95
Publish Date:   2000-10-01
Description:    An in-depth look at creating applications with XML.

如果您需要从<catalog> 内的every 书籍中获取相同的信息，您只需将//catalog/book[1] 更改为//catalog/book，然后循环遍历结果以提取每本书的字段数据。

【讨论】：