就 Python 而言:
给定一个 XML 文件 (thedoc.xml),例如:
<?xml version="1.0" encoding="UTF-8"?>
<docCollection>
<doc id="123" url="http://url.org/thing?curid=123" title="Farenheit451">
Farenheit451
It was a pleasure to burn...
</doc>
<doc id="456" url="http://url.org/thing?curid=456" title="Sense and sensitivity">
Sense and sensitivity
It was sensibile to be sensitive & nice...
</doc>
</docCollection>
还有一个使用lxml的脚本(thecode.py),如:
from lxml import etree
import pandas
import HTMLParser
inFile = "./thedoc.xml"
outFile = "./theprocdoc.csv"
#It is likely that your XML might be too big to be parsed into memory,
#for this reason it is better to use the incremental parser from lxml.
#This is initialised here to be triggering an "event" after a "doc" tag
#has been parsed.
ctx = etree.iterparse(inFile, events = ("end",), tag=("doc",))
hp = HTMLParser.HTMLParser()
csvData = []
#For every parsed element in the "context"...
for event, elem in ctx:
#...isolate the tag's attributes and apply some formating to its text
#Please note that you can remove the cgi.escape if you are not interested in HTML escaping. Please also note that the body is simply split at the newline character and then rejoined to ommit the title.
csvData.append({"id":elem.get("id"),
"url":elem.get("url"),
"title":elem.get("title"),
"body":hp.unescape("".join(elem.text.split("\n")[2:]))})
elem.clear() #It is important to call clear here, to release the memory occupied by the element's parsed data.
#Finally, simply turn the list of dictionaries to a DataFrame and writeout the CSV. I am using pandas' to_csv here for convenience.
pandas.DataFrame(csvData).to_csv(outFile, index = False)
它将生成一个 CSV (theprocdoc.csv),如下所示:
body,id,title,url
It was a pleasure to burn... ,123,Farenheit451,http://url.org/thing?curid=123
It was sensibile to be sensitive... ,456,Sense and sensibility,http://url.org/thing?curid=456
有关更多信息(并且由于我无法在内联 cmets 中格式化链接),请参阅 lxml.etree.iterparse、cgi.escape、pandas.DataFrame.to_csv。
希望这会有所帮助。