【发布时间】:2025-12-02 02:15:01
【问题描述】:
我创建了以下将 XML 文件转换为 DataFrame 的函数。此功能适用于小于 1 GB 的文件,任何大于 RAM(13GB Google Colab RAM)崩溃的文件。如果我在 Jupyter Notebook(4GB 笔记本电脑 RAM)上本地尝试,也会发生同样的情况。有没有办法优化代码?
代码
#Libraries
import pandas as pd
import xml.etree.cElementTree as ET
#Function to convert XML file to Pandas Dataframe
def xml2df(file_path):
#Parsing XML File and obtaining root
tree = ET.parse(file_path)
root = tree.getroot()
dict_list = []
for _, elem in ET.iterparse(file_path, events=("end",)):
if elem.tag == "row":
dict_list.append(elem.attrib) # PARSE ALL ATTRIBUTES
elem.clear()
df = pd.DataFrame(dict_list)
return df
XML 文件的一部分('Badges.xml')
<badges>
<row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
<row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
我也尝试了SAX 代码,但我得到了同样的 RAM Exhausted 错误。
导入xml.sax
import xml.sax
class BadgeHandler(xml.sax.ContentHandler):
def __init__(self):
self.row = None
self.row_data = []
self.df = None
# Call when an element starts
def startElement(self, tag, attributes):
if tag == 'row':
self.row = attributes._attrs
# Call when an elements ends
def endElement(self, tag):
if self.row and tag == 'row':
self.row_data.append(self.row)
def endDocument(self):
self.df = pd.DataFrame(self.row_data)
LOAD_FROM_FILE = True
handler = BadgeHandler()
if LOAD_FROM_FILE:
print('loading from file')
# 'rows.xml' is a file that contains your XML example
xml.sax.parse('/content/Badges.xml', handler)
else:
print('loading from string')
xml.sax.parseString(xml_str, handler)
print(handler.df)
【问题讨论】:
-
如果您从
dict_list中删除创建数据框的尝试,它会崩溃吗? -
另外,请显示您得到的实际回溯/错误。
-
@AKX 我没有得到回溯,RAM 只是崩溃并且会话重新启动。除了
dict_list,我没有任何其他方法可以创建数据框。
标签: python xml pandas dataframe dictionary