使用 lxml 解析 XML 的问题答案

【问题标题】：Problems parsing XML with lxml使用 lxml 解析 XML 的问题
【发布时间】：2016-04-25 03:31:20
【问题描述】：

我一直在尝试将 XML 提要解析为 Pandas 数据框，但不知道哪里出错了。

import pandas as pd
import requests
import lxml.objectify
path = "http://www2.cineworld.co.uk/syndication/listings.xml"

xml = lxml.objectify.parse(path)
root = xml.getroot()

下一段代码是解析我想要的位并创建一个显示字典列表。

shows_list = []
for r in root.cinema:
    rec = {}
    rec['name'] = r.attrib['name']
    rec['info'] = r.attrib["root"] + r.attrib['url']
    listing = r.find("listing")
    for f in listing.film:
        film = rec
        film['title'] = f.attrib['title']
        film['rating'] = f.attrib['rating']
        shows = f.find("shows")
        for s in shows['show']:
            show = rec
            show['time'] = s.attrib['time']
            show['url'] = s.attrib['url']
            #print show
            shows_list.append(rec)

df = pd.DataFrame(show_list)

当我运行代码时，电影和时间字段似乎在行内被复制了多次。但是，如果我在代码中添加一个 print 语句（它被注释掉了），那么字典看起来就像我所期望的那样。

我做错了什么？请随时让我知道是否有更 Pythonic 的方式来执行解析过程。

编辑：澄清：

如果我使用打印语句检查循环时发生的情况，这些是最后五行数据。

{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729365&seats=STANDARD', 'time': '2016-02-07T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729366&seats=STANDARD', 'time': '2016-02-08T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729367&seats=STANDARD', 'time': '2016-02-09T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729368&seats=STANDARD', 'time': '2016-02-10T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729369&seats=STANDARD', 'time': '2016-02-11T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'PG', 'name': 'Cineworld Stoke-on-Trent', 'title': 'Autism Friendly Screening - Goosebumps', 'url': '/booking?performance=4782937&seats=STANDARD', 'time': '2016-02-07T11:00:00'}

这是列表的结尾： ...

{'info': 'http://cineworld.co.uk/cinemas/107/information',
  'name': 'Cineworld Stoke-on-Trent',
  'rating': 'PG',
  'time': '2016-02-07T11:00:00',
  'title': 'Autism Friendly Screening - Goosebumps',
  'url': '/booking?performance=4782937&seats=STANDARD'},
 {'info': 'http://cineworld.co.uk/cinemas/107/information',
  'name': 'Cineworld Stoke-on-Trent',
  'rating': 'PG',
  'time': '2016-02-07T11:00:00',
  'title': 'Autism Friendly Screening - Goosebumps',
  'url': '/booking?performance=4782937&seats=STANDARD'},
 {'info': 'http://cineworld.co.uk/cinemas/107/information',
  'name': 'Cineworld Stoke-on-Trent',
  'rating': 'PG',
  'time': '2016-02-07T11:00:00',
  'title': 'Autism Friendly Screening - Goosebumps',
  'url': '/booking?performance=4782937&seats=STANDARD'},
 {'info': 'http://cineworld.co.uk/cinemas/107/information',
  'name': 'Cineworld Stoke-on-Trent',
  'rating': 'PG',
  'time': '2016-02-07T11:00:00',
  'title': 'Autism Friendly Screening - Goosebumps',
  'url': '/booking?performance=4782937&seats=STANDARD'}]

【问题讨论】：

print(show_list) - 也许您在 show_list 中有多次数据？也许xml中有多次数据？
使用更多print 看看发生了什么。您在for 循环中使用append，因此您可能添加相同的rect 和相同的name 但不同的title 或相同的title 但不同的time。
您的字典中只有一个title 和time 键，您不打算有多个条目吗？（您在每次通过时都会覆盖密钥）
@salparadise 我的想法是使用字典意味着特定电影院中特定电影的各个时间将出现在不同的字典中。

标签： python xml pandas lxml lxml.objectify

【解决方案1】：

您的代码只有一个不断更新的对象：rec。试试这个：

from copy import copy
shows_list = []
for r in root.cinema:
    rec = {}
    rec['name'] = r.attrib['name']
    rec['info'] = r.attrib["root"] + r.attrib['url']
    listing = r.find("listing")
    for f in listing.film:
        film = copy(rec) # New object
        film['title'] = f.attrib['title']
        film['rating'] = f.attrib['rating']
        shows = f.find("shows")
        for s in shows['show']:
            show = copy(film) # New object, changed reference
            show['time'] = s.attrib['time']
            show['url'] = s.attrib['url']
            #print show
            shows_list.append(show) # Changed reference

df = pd.DataFrame(show_list)

通过这种结构，rec中的数据被复制到每个film中，每个film中的数据被复制到每个show中。然后，最后将show 添加到shows_list。

您可能想阅读this article 以了解更多关于film = rec 行中发生的事情，即您正在为原始字典指定另一个名称，而不是创建一个新字典。

【讨论】：

那太好了。谢谢你。我在 numpy 和 pandas 中的副本有类似的问题，但我认为我可以简单地使用旧字典的数据设置新字典的值。显然不是。