【发布时间】:2016-04-25 03:31:20
【问题描述】:
我一直在尝试将 XML 提要解析为 Pandas 数据框,但不知道哪里出错了。
import pandas as pd
import requests
import lxml.objectify
path = "http://www2.cineworld.co.uk/syndication/listings.xml"
xml = lxml.objectify.parse(path)
root = xml.getroot()
下一段代码是解析我想要的位并创建一个显示字典列表。
shows_list = []
for r in root.cinema:
rec = {}
rec['name'] = r.attrib['name']
rec['info'] = r.attrib["root"] + r.attrib['url']
listing = r.find("listing")
for f in listing.film:
film = rec
film['title'] = f.attrib['title']
film['rating'] = f.attrib['rating']
shows = f.find("shows")
for s in shows['show']:
show = rec
show['time'] = s.attrib['time']
show['url'] = s.attrib['url']
#print show
shows_list.append(rec)
df = pd.DataFrame(show_list)
当我运行代码时,电影和时间字段似乎在行内被复制了多次。但是,如果我在代码中添加一个 print 语句(它被注释掉了),那么字典看起来就像我所期望的那样。
我做错了什么?请随时让我知道是否有更 Pythonic 的方式来执行解析过程。
编辑:澄清:
如果我使用打印语句检查循环时发生的情况,这些是最后五行数据。
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729365&seats=STANDARD', 'time': '2016-02-07T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729366&seats=STANDARD', 'time': '2016-02-08T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729367&seats=STANDARD', 'time': '2016-02-09T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729368&seats=STANDARD', 'time': '2016-02-10T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729369&seats=STANDARD', 'time': '2016-02-11T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'PG', 'name': 'Cineworld Stoke-on-Trent', 'title': 'Autism Friendly Screening - Goosebumps', 'url': '/booking?performance=4782937&seats=STANDARD', 'time': '2016-02-07T11:00:00'}
这是列表的结尾: ...
{'info': 'http://cineworld.co.uk/cinemas/107/information',
'name': 'Cineworld Stoke-on-Trent',
'rating': 'PG',
'time': '2016-02-07T11:00:00',
'title': 'Autism Friendly Screening - Goosebumps',
'url': '/booking?performance=4782937&seats=STANDARD'},
{'info': 'http://cineworld.co.uk/cinemas/107/information',
'name': 'Cineworld Stoke-on-Trent',
'rating': 'PG',
'time': '2016-02-07T11:00:00',
'title': 'Autism Friendly Screening - Goosebumps',
'url': '/booking?performance=4782937&seats=STANDARD'},
{'info': 'http://cineworld.co.uk/cinemas/107/information',
'name': 'Cineworld Stoke-on-Trent',
'rating': 'PG',
'time': '2016-02-07T11:00:00',
'title': 'Autism Friendly Screening - Goosebumps',
'url': '/booking?performance=4782937&seats=STANDARD'},
{'info': 'http://cineworld.co.uk/cinemas/107/information',
'name': 'Cineworld Stoke-on-Trent',
'rating': 'PG',
'time': '2016-02-07T11:00:00',
'title': 'Autism Friendly Screening - Goosebumps',
'url': '/booking?performance=4782937&seats=STANDARD'}]
【问题讨论】:
-
print(show_list)- 也许您在 show_list 中有多次数据?也许xml中有多次数据? -
使用更多
print看看发生了什么。您在for循环中使用append,因此您可能添加相同的rect和相同的name但不同的title或相同的title但不同的time。 -
您的字典中只有一个
title和time键,您不打算有多个条目吗? (您在每次通过时都会覆盖密钥) -
@salparadise 我的想法是使用字典意味着特定电影院中特定电影的各个时间将出现在不同的字典中。
标签: python xml pandas lxml lxml.objectify