【问题标题】:xml (and html) parsing in python在python中解析xml(和html)
【发布时间】:2016-03-01 11:17:27
【问题描述】:

我最近开始研究 python。我正在尝试解析 xml 文档。考虑以下 xml 文件以供参考:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
</catalog>

在这里我想检索第一个 book 标记及其所有内容,即

<book id="bk101">
  <author>Gambardella, Matthew</author>
  <title>XML Developer's Guide</title>
  <genre>Computer</genre>
  <price>44.95</price>
  <publish_date>2000-10-01</publish_date>
  <description>An in-depth look at creating applications
  with XML.</description>
</book>

我来自 scala 背景,在那里我可以轻松做到这一点

val node = scala.xml.XML.loadString(str)
val nodeSeq = node \\ "book"
nodeSeq.head.toString()

我曾尝试使用lxmlxpath 来执行此操作,但它会变得复杂(递归地获取嵌套元素的内容)以实现上述要求。在python中有没有简单的方法来做到这一点?也可以扩展为html吗?

TIA

【问题讨论】:

  • 您是否尝试过使用 minidom,对于有 Scala 或 Java 背景的人来说,它可能是最简单的软件包。

标签: python xml xml-parsing html-parsing


【解决方案1】:

使用lxml 和 xpath

from lxml import etree

data = """<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
</catalog>"""

tree = etree.fromstring(data)
book = tree.xpath("//catalog/book") #or book = tree.xpath("(//catalog/book)[1]")
for i in book[0]:#[0] means first book
    print etree.tostring(i)

输出-

<book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>

【讨论】:

    【解决方案2】:

    这是仅提取第一本书的 XPath:

    //catalog/book[1]
    

    这是返回所需结果的完整代码:

    from lxml import html
    
    XML = """<?xml version="1.0"?>
    <catalog>
       <book id="bk101">
          <author>Gambardella, Matthew</author>
          <title>XML Developer's Guide</title>
          <genre>Computer</genre>
          <price>44.95</price>
          <publish_date>2000-10-01</publish_date>
          <description>An in-depth look at creating applications
          with XML.</description>
       </book>
       <book id="bk102">
          <author>Ralls, Kim</author>
          <title>Midnight Rain</title>
          <genre>Fantasy</genre>
          <price>5.95</price>
          <publish_date>2000-12-16</publish_date>
          <description>A former architect battles corporate zombies,
          an evil sorceress, and her own childhood to become queen
          of the world.</description>
       </book>
    </catalog>"""
    
    tree = html.fromstring(XML)
    first_book = tree.xpath('//catalog/book[1]')[0]
    book_id = first_book.xpath('@id')[0]
    author = first_book.xpath('.//author/text()')[0]
    title = first_book.xpath('.//title/text()')[0]
    genre = first_book.xpath('.//genre/text()')[0]
    price = first_book.xpath('.//price/text()')[0]
    publish_date = first_book.xpath('.//publish_date/text()')[0]
    description = first_book.xpath('.//description/text()')[0].replace('\n',' ').replace('  ','')
    
    print """Book Id:\t\t{}
    Author:\t\t\t{}
    Title:\t\t\t{}
    Genre:\t\t\t{}
    Price:\t\t\t{}
    Publish Date:\t{}
    Description:\t{}""".format(book_id,author,title,genre,price,publish_date,description)
    

    输出:

    Book Id:        bk101
    Author:         Gambardella, Matthew
    Title:          XML Developer's Guide
    Genre:          Computer
    Price:          44.95
    Publish Date:   2000-10-01
    Description:    An in-depth look at creating applications with XML.
    

    如果您需要从&lt;catalog&gt; 内的every 书籍中获取相同的信息,您只需将//catalog/book[1] 更改为//catalog/book,然后循环遍历结果以提取每本书的字段数据。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-03-19
      • 2016-05-07
      • 1970-01-01
      • 1970-01-01
      • 2018-01-27
      相关资源
      最近更新 更多