将“文档格式”/XML 转换为 CSV答案

【问题标题】：Converting "document format" / XML to CSV将“文档格式”/XML 转换为 CSV
【发布时间】：2015-07-11 20:01:47
【问题描述】：

我正在尝试转换：

<doc id="123" url="http://url.org/thing?curid=123" title="title"> 
Title

text text text more text

</doc>

写入 CSV 文件（该文件有大量类似格式的“文档”）。如果它是一个常规的 XML 文件，我想我可以使用像 this 这样的解决方案来解决它，但是由于上面的代码不是常规的 XML 格式，所以我被卡住了。

我正在尝试将数据导入 postgresql，从我收集的数据来看，如果它是 CSV 格式，导入这些信息会更容易（如果有其他方法，请告诉我）。我需要的是分离出“id”、“url”、“title”和“text/body”。

额外问题：文本/正文中的第一行是文档的标题，是否可以在转换中删除/操作第一行？

谢谢！

【问题讨论】：

这部分不清楚：“该文件有大量类似格式的“文档””。如果它有多个doc 元素，请向我们展示一个至少有两个的示例，包括包装器。请也向我们展示预期的输出。
A_A 的回答几乎涵盖了所有内容。该文件有一堆这些 doc 元素，没有包装隔离。我用echo '<?xml version="1.0" encoding="UTF-8"?> <docCollection>' | cat - app > temp && mv temp app && echo '</docCollection>' >> app（其中app是文件名）添加了必要的代码。
如果没有包装器（即单个根元素），则它不是 XML。

标签： python xml postgresql csv xslt

【解决方案1】：

就 Python 而言：

给定一个 XML 文件 (thedoc.xml)，例如：

<?xml version="1.0" encoding="UTF-8"?>
<docCollection>
    <doc id="123" url="http://url.org/thing?curid=123" title="Farenheit451"> 
    Farenheit451

    It was a pleasure to burn...
    </doc>

    <doc id="456" url="http://url.org/thing?curid=456" title="Sense and sensitivity"> 
    Sense and sensitivity

    It was sensibile to be sensitive &amp; nice...
    </doc>        
</docCollection>

还有一个使用lxml的脚本（thecode.py），如：

from lxml import etree
import pandas
import HTMLParser 

inFile = "./thedoc.xml"
outFile = "./theprocdoc.csv"

#It is likely that your XML might be too big to be parsed into memory,
#for this reason it is better to use the incremental parser from lxml.
#This is initialised here to be triggering an "event" after a "doc" tag
#has been parsed.
ctx = etree.iterparse(inFile, events = ("end",), tag=("doc",))

hp = HTMLParser.HTMLParser()
csvData = []
#For every parsed element in the "context"...
for event, elem in ctx:
    #...isolate the tag's attributes and apply some formating to its text
    #Please note that you can remove the cgi.escape if you are not interested in HTML escaping. Please also note that the body is simply split at the newline character and then rejoined to ommit the title.
    csvData.append({"id":elem.get("id"),
                    "url":elem.get("url"),
                    "title":elem.get("title"),
                    "body":hp.unescape("".join(elem.text.split("\n")[2:]))})
    elem.clear() #It is important to call clear here, to release the memory occupied by the element's parsed data.

#Finally, simply turn the list of dictionaries to a DataFrame and writeout the CSV. I am using pandas' to_csv here for convenience.
pandas.DataFrame(csvData).to_csv(outFile, index = False)

它将生成一个 CSV (theprocdoc.csv)，如下所示：

body,id,title,url
        It was a pleasure to burn...    ,123,Farenheit451,http://url.org/thing?curid=123
        It was sensibile to be sensitive...    ,456,Sense and sensibility,http://url.org/thing?curid=456

有关更多信息（并且由于我无法在内联 cmets 中格式化链接），请参阅 lxml.etree.iterparse、cgi.escape、pandas.DataFrame.to_csv。

希望这会有所帮助。

【讨论】：

这将问题解决到了最后一个细节，它甚至完美地回答了奖金问题。非常感谢您的快速回复！
谢谢，很高兴您的回复对您有帮助。
事实证明，实际上有一件事我无法在这里工作，cgi.escape（我认为）。当它碰到 & 或 & 手动替换它们，但对于等中使用。因此，为了让它在所有情况下都能正常工作，我认为必须对"body":cgi.escape("" 部分进行一些处理。再次感谢您的帮助！
请查看修改后的回复，在内容中添加了&amp;，未转义为CSV中的相应字符。
再次感谢您的帮助，不幸的是我无法让它工作。 Python 在尝试转换和 & 时仍然会抛出错误。例如，如果我尝试通过以下方式运行程序：in most cases 5 is < than 6 & 7 它会引发错误。我可以用&amp; 搜索和替换&，但这不适用于等等。我可能在这里遗漏了一些非常基本的东西，但我就是想不通！如果程序只是忽略正文中的 &""，或者（甚至更好）将它们转换为 < >和&它会工作的。