【问题标题】:Python XML Parse and getElementsByTagNamePython XML 解析和 getElementsByTagName
【发布时间】:2020-07-24 18:37:27
【问题描述】:

我试图解析以下 xml 并获取我对围绕我的业务需求感兴趣的特定标签。我想我做错了什么。不知道如何解析我需要的标签??想利用熊猫,以便我可以进一步过滤细节。学徒所有的支持

我的 XMl 来自 URI

<couponfeed>
 <TotalMatches>1459</TotalMatches>
 <TotalPages>3</TotalPages>
 <PageNumberRequested>1</PageNumberRequested>
 <link type="TEXT">
  <categories>
   <category id="1">Apparel</category>
  </categories>
  <promotiontypes>
    <promotiontype id="11">Percentage off</promotiontype>
   </promotiontypes>
   <offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
   <offerstartdate>2020-07-24</offerstartdate>
   <offerenddate>2020-07-26</offerenddate>
   <clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
    <impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
    <advertiserid>3184</advertiserid>
    <advertisername>cys.com</advertisername>
    <network id="1">US Network</network>
  </link>
 <link type="TEXT">
  <categories>
   <category id="1">Apparel</category>
  </categories>
  <promotiontypes>
   <promotiontype id="11">Percentage off</promotiontype>
  </promotiontypes>
  <offerdescription>25% Off Boys' Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
  <offerstartdate>2020-07-24</offerstartdate>
  <offerenddate>2020-07-26</offerenddate>
  <clickurl>https://click.synergy.com/fs-bin/click?id=ZZvk49eM&offerid=777210.100474695&type=3&subid=0</clickurl>
  <impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZZvk49NAwbids=777210.100474695&type=3&subid=0</impressionpixel>
  <advertiserid>3184</advertiserid>
  <advertisername>cys.com</advertisername>
  <network id="1">US Network</network>
 </link>

我的代码

from xml.dom import minidom
import urllib
import pandas as pd 
url = "http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500"
xmldoc = minidom.parse(urllib.request.urlopen(url)) 

#itemlist = xmldoc.getElementsByTagName('clickurl')


df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]
rows = []

for entry in xmldoc.couponfeed:
    s_promotiontype = couponfeed.get("promotiontype","")
    s_category = couponfeed.get("category","")
    s_offerdescription = couponfeed.get("offerdescription", "")
    s_offerstartdate = couponfeed.get("offerstartdate", "")
    s_offerenddate = couponfeed.get("offerenddate", "")
    s_clickurl = couponfeed.get("clickurl", "")
    s_impressionpixel = couponfeed.get("impressionpixel", "")
    s_advertisername = couponfeed.get("advertisername","")
    s_network = couponfeed.get ("network","")
       
        
    rows.append({"promotiontype":s_promotiontype, "category": s_category, "offerdescription": s_offerdescription, 
                 "offerstartdate": s_offerstartdate, "offerenddate": s_offerenddate,"clickurl": s_clickurl,"impressionpixel":s_impressionpixel,
                 "advertisername": s_advertisername,"network": s_network})

out_df = pd.DataFrame(rows, columns=df_cols)


out_df.to_csv(r"C:\\Users\rai\Downloads\\merchants_offers_share.csv", index=False)

尝试简单的方法,但我没有得到任何结果

import lxml.etree as ET 
import urllib

response = urllib.request.urlopen('http://couponfeed.synergy.com/coupon?token=xxxxxd39f4e5fe392a25538bb122b&network=1&resultsperpage=500')
xml = response.read()

root = ET.fromstring(xml)

for item in root.findall('.//item'):
    title = item.find('category').text
    print (title)

再试一次

from lxml import etree
import pandas as pd 
import urllib

    url = "http://couponfeed.synergy.com/coupon?token=xxxxxxd39f4e5fe392a25538bb122b&network=1&resultsperpage=500"
    xtree = etree.parse(urllib.request.urlopen(url)) 
    
    for value in xtree.xpath("/root/couponfeed/categories"):
        print(value.text)

【问题讨论】:

  • 你的代码不工作怎么办?
  • @ReinstateMonica 所以我得到 AttributeError: 'Document' object has no attribute 'couponfeed'
  • 你能提供工作的 xml 文档以便我测试它吗?原样,xml 有语法错误
  • @ReinstateMonica 可以私下分享吗?
  • 当然,给我发一封电子邮件:forspam103 (at) gmail.com

标签: python xml pandas


【解决方案1】:

另一种方法。

from simplified_scrapy import SimplifiedDoc, utils, req
# html = req.get('http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500')
html = '''
<couponfeed>
 <TotalMatches>1459</TotalMatches>
 <TotalPages>3</TotalPages>
 <PageNumberRequested>1</PageNumberRequested>
 <link type="TEXT">
  <categories>
   <category id="1">Apparel</category>
  </categories>
  <promotiontypes>
    <promotiontype id="11">Percentage off</promotiontype>
   </promotiontypes>
   <offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
   <offerstartdate>2020-07-24</offerstartdate>
   <offerenddate>2020-07-26</offerenddate>
   <clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
    <impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
    <advertiserid>3184</advertiserid>
    <advertisername>cys.com</advertisername>
    <network id="1">US Network</network>
  </link>
 </couponfeed>
'''
doc = SimplifiedDoc(html)
df_cols = [
    "promotiontype", "category", "offerdescription", "offerstartdate",
    "offerenddate", "clickurl", "impressionpixel", "advertisername", "network"
]
rows = [df_cols]

links = doc.couponfeed.links  # Get all links
for link in links:
    row = []
    for col in df_cols:
        row.append(link.select(col).text)  # Get col text
    rows.append(row)

utils.save2csv('merchants_offers_share.csv', rows)  # Save to csv file

结果:

promotiontype,category,offerdescription,offerstartdate,offerenddate,clickurl,impressionpixel,advertisername,network
Percentage off,Apparel,25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!,2020-07-24,2020-07-26,https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0,https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0,cys.com,US Network

这里有更多示例:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

删除最后一个空行

import io
with io.open('merchants_offers_share.csv', "rb+") as f:
    f.seek(-1,2)
    l = f.read()
    if l == b"\n":
        f.seek(-2,2)
        f.truncate()

【讨论】:

  • 谢谢。奇迹般有效。快速提问为什么 csv 输出会留下一个空行(即)我得到带有替代空记录的记录。如何获得没有空行的漂亮 .csv?
  • @sunnybabau 很高兴能为您提供帮助。最后一个空行是为了方便追加数据。如果不需要,可以按照上面的方法删除。我改变了答案。
  • 你能否为这个stackoverflow.com/questions/63122779/…提出一个可行的解决方案@
【解决方案2】:

首先,xml 文档没有解析,因为您从源页面复制了一个原始 & 符号 &amp;amp;,这就像 xml 中的关键字。当您的浏览器呈现 xml(或 html)时,它会将 &amp;amp; 转换为 &amp;amp;

至于代码,获取数据最简单的方法是遍历df_cols,然后对每一列执行getElementsByTagName,这将返回给定列的元素列表。

from xml.dom import minidom
import pandas as pd
import urllib

limit = 500
url = f"http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage={limit}"


xmldoc = minidom.parse(urllib.request.urlopen(url))

df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]

# create an object for each row
rows = [{} for i in range(limit)]

nodes = xmldoc.getElementsByTagName("promotiontype")
node = nodes[0]

for row_name in df_cols:

    # get results for each row_name
    nodes = xmldoc.getElementsByTagName(row_name)
    for i, node in enumerate(nodes):
        rows[i][row_name] = node.firstChild.nodeValue


out_df = pd.DataFrame(rows, columns=df_cols)

nodes = et.getElementsByTagName("promotiontype")
node = nodes[0]

for row_name in df_cols:
    nodes = et.getElementsByTagName(row_name)
    for i, node in enumerate(nodes):
        rows[i][row_name] = node.firstChild.nodeValue


out_df = pd.DataFrame(rows, columns=df_cols)

这不是最有效的方法,但我不确定如何使用minidom。如果效率是一个问题,我建议改用lxml

【讨论】:

  • 在上面尝试过 LXML,不确定我在这里缺少什么——用我的尝试更新了我的帖子
【解决方案3】:

假设从 URL 解析 XML 没有问题(因为我们端没有链接),如果您在实际节点上解析,您的第一个 lxml 可以工作。具体来说,XML 文档中没有&lt;item&gt; 节点。

改为使用link。并考虑使用嵌套列表/字典理解将内容迁移到数据框。对于lxml,您可以换出findallxpath 以返回相同的结果。

df = pd.DataFrame([{item.tag: item.text if item.text.strip() != "" else item.find("*").text
                       for item in lnk.findall("*") if item is not None} 
                       for lnk in root.findall('.//link')])
                       
print(df)
#   categories  promotiontypes                                   offerdescription  ... advertiserid advertisername     network
# 0    Apparel  Percentage off  25% Off Boys Quiksilver Apparel. Shop now at M...  ...         3184        cys.com  US Network
# 1    Apparel  Percentage off  25% Off Boys' Quiksilver Apparel. Shop now at ...  ...         3184        cys.com  US Network

【讨论】:

  • 谢谢.. 所以我得到 AttributeError: 'NoneType' object has no attribute 'strip'
  • 您可能有没有子元素的&lt;link&gt; 节点。见编辑过滤掉Noneitem内部for
  • 对不起,我收到 。但与此同时,@dabingsou 提供的以下解决方案有效,这意味着 URL 没有任何问题。不过谢谢大家的支持
  • 我喜欢你的方法,不知道这里有什么问题,你能帮忙把它解析成一个数据帧ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200
  • 你试过了吗?对于 ebay XML,只需将 .//link 替换为 .//item。至于其他问题,不知道为什么会收到错误,但与此代码无关。上面的这个答案适用于您发布的 XML。查看访问 URL 的 API 说明。也许你必须发送参数/标题?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-03-16
相关资源
最近更新 更多