Python XML 解析和 getElementsByTagName答案

【问题标题】：Python XML Parse and getElementsByTagNamePython XML 解析和 getElementsByTagName
【发布时间】：2020-07-24 18:37:27
【问题描述】：

我试图解析以下 xml 并获取我对围绕我的业务需求感兴趣的特定标签。我想我做错了什么。不知道如何解析我需要的标签？？想利用熊猫，以便我可以进一步过滤细节。学徒所有的支持

我的 XMl 来自 URI

<couponfeed>
 <TotalMatches>1459</TotalMatches>
 <TotalPages>3</TotalPages>
 <PageNumberRequested>1</PageNumberRequested>
 <link type="TEXT">
  <categories>
   <category id="1">Apparel</category>
  </categories>
  <promotiontypes>
    <promotiontype id="11">Percentage off</promotiontype>
   </promotiontypes>
   <offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
   <offerstartdate>2020-07-24</offerstartdate>
   <offerenddate>2020-07-26</offerenddate>
   <clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
    <impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
    <advertiserid>3184</advertiserid>
    <advertisername>cys.com</advertisername>
    <network id="1">US Network</network>
  </link>
 <link type="TEXT">
  <categories>
   <category id="1">Apparel</category>
  </categories>
  <promotiontypes>
   <promotiontype id="11">Percentage off</promotiontype>
  </promotiontypes>
  <offerdescription>25% Off Boys' Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
  <offerstartdate>2020-07-24</offerstartdate>
  <offerenddate>2020-07-26</offerenddate>
  <clickurl>https://click.synergy.com/fs-bin/click?id=ZZvk49eM&offerid=777210.100474695&type=3&subid=0</clickurl>
  <impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZZvk49NAwbids=777210.100474695&type=3&subid=0</impressionpixel>
  <advertiserid>3184</advertiserid>
  <advertisername>cys.com</advertisername>
  <network id="1">US Network</network>
 </link>

我的代码

from xml.dom import minidom
import urllib
import pandas as pd 
url = "http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500"
xmldoc = minidom.parse(urllib.request.urlopen(url)) 

#itemlist = xmldoc.getElementsByTagName('clickurl')


df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]
rows = []

for entry in xmldoc.couponfeed:
    s_promotiontype = couponfeed.get("promotiontype","")
    s_category = couponfeed.get("category","")
    s_offerdescription = couponfeed.get("offerdescription", "")
    s_offerstartdate = couponfeed.get("offerstartdate", "")
    s_offerenddate = couponfeed.get("offerenddate", "")
    s_clickurl = couponfeed.get("clickurl", "")
    s_impressionpixel = couponfeed.get("impressionpixel", "")
    s_advertisername = couponfeed.get("advertisername","")
    s_network = couponfeed.get ("network","")
       
        
    rows.append({"promotiontype":s_promotiontype, "category": s_category, "offerdescription": s_offerdescription, 
                 "offerstartdate": s_offerstartdate, "offerenddate": s_offerenddate,"clickurl": s_clickurl,"impressionpixel":s_impressionpixel,
                 "advertisername": s_advertisername,"network": s_network})

out_df = pd.DataFrame(rows, columns=df_cols)


out_df.to_csv(r"C:\\Users\rai\Downloads\\merchants_offers_share.csv", index=False)

尝试简单的方法，但我没有得到任何结果

import lxml.etree as ET 
import urllib

response = urllib.request.urlopen('http://couponfeed.synergy.com/coupon?token=xxxxxd39f4e5fe392a25538bb122b&network=1&resultsperpage=500')
xml = response.read()

root = ET.fromstring(xml)

for item in root.findall('.//item'):
    title = item.find('category').text
    print (title)

再试一次

from lxml import etree
import pandas as pd 
import urllib

    url = "http://couponfeed.synergy.com/coupon?token=xxxxxxd39f4e5fe392a25538bb122b&network=1&resultsperpage=500"
    xtree = etree.parse(urllib.request.urlopen(url)) 
    
    for value in xtree.xpath("/root/couponfeed/categories"):
        print(value.text)

【问题讨论】：

你的代码不工作怎么办？
@ReinstateMonica 所以我得到 AttributeError: 'Document' object has no attribute 'couponfeed'
你能提供工作的 xml 文档以便我测试它吗？原样，xml 有语法错误
@ReinstateMonica 可以私下分享吗？
当然，给我发一封电子邮件：forspam103 (at) gmail.com

标签： python xml pandas

【解决方案1】：

另一种方法。

from simplified_scrapy import SimplifiedDoc, utils, req
# html = req.get('http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500')
html = '''
<couponfeed>
 <TotalMatches>1459</TotalMatches>
 <TotalPages>3</TotalPages>
 <PageNumberRequested>1</PageNumberRequested>
 <link type="TEXT">
  <categories>
   <category id="1">Apparel</category>
  </categories>
  <promotiontypes>
    <promotiontype id="11">Percentage off</promotiontype>
   </promotiontypes>
   <offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
   <offerstartdate>2020-07-24</offerstartdate>
   <offerenddate>2020-07-26</offerenddate>
   <clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
    <impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
    <advertiserid>3184</advertiserid>
    <advertisername>cys.com</advertisername>
    <network id="1">US Network</network>
  </link>
 </couponfeed>
'''
doc = SimplifiedDoc(html)
df_cols = [
    "promotiontype", "category", "offerdescription", "offerstartdate",
    "offerenddate", "clickurl", "impressionpixel", "advertisername", "network"
]
rows = [df_cols]

links = doc.couponfeed.links  # Get all links
for link in links:
    row = []
    for col in df_cols:
        row.append(link.select(col).text)  # Get col text
    rows.append(row)

utils.save2csv('merchants_offers_share.csv', rows)  # Save to csv file

结果：

promotiontype,category,offerdescription,offerstartdate,offerenddate,clickurl,impressionpixel,advertisername,network
Percentage off,Apparel,25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!,2020-07-24,2020-07-26,https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0,https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0,cys.com,US Network

删除最后一个空行

import io
with io.open('merchants_offers_share.csv', "rb+") as f:
    f.seek(-1,2)
    l = f.read()
    if l == b"\n":
        f.seek(-2,2)
        f.truncate()

【讨论】：

谢谢。奇迹般有效。快速提问为什么 csv 输出会留下一个空行（即）我得到带有替代空记录的记录。如何获得没有空行的漂亮 .csv？
@sunnybabau 很高兴能为您提供帮助。最后一个空行是为了方便追加数据。如果不需要，可以按照上面的方法删除。我改变了答案。
你能否为这个stackoverflow.com/questions/63122779/…提出一个可行的解决方案@

【解决方案2】：

首先，xml 文档没有解析，因为您从源页面复制了一个原始 & 符号 &amp;，这就像 xml 中的关键字。当您的浏览器呈现 xml（或 html）时，它会将 &amp; 转换为 &amp;。

至于代码，获取数据最简单的方法是遍历df_cols，然后对每一列执行getElementsByTagName，这将返回给定列的元素列表。

from xml.dom import minidom
import pandas as pd
import urllib

limit = 500
url = f"http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage={limit}"


xmldoc = minidom.parse(urllib.request.urlopen(url))

df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]

# create an object for each row
rows = [{} for i in range(limit)]

nodes = xmldoc.getElementsByTagName("promotiontype")
node = nodes[0]

for row_name in df_cols:

    # get results for each row_name
    nodes = xmldoc.getElementsByTagName(row_name)
    for i, node in enumerate(nodes):
        rows[i][row_name] = node.firstChild.nodeValue


out_df = pd.DataFrame(rows, columns=df_cols)

nodes = et.getElementsByTagName("promotiontype")
node = nodes[0]

for row_name in df_cols:
    nodes = et.getElementsByTagName(row_name)
    for i, node in enumerate(nodes):
        rows[i][row_name] = node.firstChild.nodeValue


out_df = pd.DataFrame(rows, columns=df_cols)

这不是最有效的方法，但我不确定如何使用minidom。如果效率是一个问题，我建议改用lxml。

【讨论】：

在上面尝试过 LXML，不确定我在这里缺少什么——用我的尝试更新了我的帖子

【解决方案3】：

假设从 URL 解析 XML 没有问题（因为我们端没有链接），如果您在实际节点上解析，您的第一个 lxml 可以工作。具体来说，XML 文档中没有<item> 节点。

改为使用link。并考虑使用嵌套列表/字典理解将内容迁移到数据框。对于lxml，您可以换出findall 和xpath 以返回相同的结果。

df = pd.DataFrame([{item.tag: item.text if item.text.strip() != "" else item.find("*").text
                       for item in lnk.findall("*") if item is not None} 
                       for lnk in root.findall('.//link')])
                       
print(df)
#   categories  promotiontypes                                   offerdescription  ... advertiserid advertisername     network
# 0    Apparel  Percentage off  25% Off Boys Quiksilver Apparel. Shop now at M...  ...         3184        cys.com  US Network
# 1    Apparel  Percentage off  25% Off Boys' Quiksilver Apparel. Shop now at ...  ...         3184        cys.com  US Network

【讨论】：

谢谢.. 所以我得到 AttributeError: 'NoneType' object has no attribute 'strip'
您可能有没有子元素的<link> 节点。见编辑过滤掉Noneitem内部for
对不起，我收到。但与此同时，@dabingsou 提供的以下解决方案有效，这意味着 URL 没有任何问题。不过谢谢大家的支持
我喜欢你的方法，不知道这里有什么问题，你能帮忙把它解析成一个数据帧ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200
你试过了吗？对于 ebay XML，只需将 .//link 替换为 .//item。至于其他问题，不知道为什么会收到错误，但与此代码无关。上面的这个答案适用于您发布的 XML。查看访问 URL 的 API 说明。也许你必须发送参数/标题？