【问题标题】:Extracting similar XML attributes with BeautifulSoup使用 BeautifulSoup 提取相似的 XML 属性
【发布时间】:2018-01-05 06:26:39
【问题描述】:

假设我有以下 XML:

<time from="2017-07-29T08:00:00" to="2017-07-29T09:00:00">
    <!-- Valid from 2017-07-29T08:00:00 to 2017-07-29T09:00:00 -->
    <symbol number="4" numberEx="4" name="Cloudy" var="04"/>
    <precipitation value="0"/>
    <!-- Valid at 2017-07-29T08:00:00 -->
    <windDirection deg="300.9" code="WNW" name="West-northwest"/>
    <windSpeed mps="1.3" name="Light air"/>
    <temperature unit="celsius" value="15"/>
    <pressure unit="hPa" value="1002.4"/>
</time>
<time from="2017-07-29T09:00:00" to="2017-07-29T10:00:00">
    <!-- Valid from 2017-07-29T09:00:00 to 2017-07-29T10:00:00 -->
    <symbol number="4" numberEx="4" name="Partly cloudy" var="04"/>
    <precipitation value="0"/>
    <!-- Valid at 2017-07-29T09:00:00 -->
    <windDirection deg="293.2" code="WNW" name="West-northwest"/>
    <windSpeed mps="0.8" name="Light air"/>
    <temperature unit="celsius" value="17"/>
    <pressure unit="hPa" value="1002.6"/>
</time>

我想从中收集time fromsymbol nametemperature value,然后按以下方式打印出来:time from: symbol name, temperaure value——就像这样:2017-07-29, 08:00:00: Cloudy, 15°

(如您所见,此 XML 中有几个 namevalue 属性。)

到目前为止,我的方法非常简单:

#!/usr/bin/env python
# coding: utf-8

import re
from BeautifulSoup import BeautifulSoup

# data is set to the above XML
soup = BeautifulSoup(data)
# collect the tags of interest into lists. can it be done wiser?
time_l = []
symb_l = []
temp_l = []
for i in soup.findAll('time'):
    i_time = str(i.get('from'))
    time_l.append(i_time)
for i in soup.findAll('symbol'):
    i_symb = str(i.get('name'))
    symb_l.append(i_symb)
for i in soup.findAll('temperature'):
    i_temp = str(i.get('value'))
    temp_l.append(i_temp)
# join the forecast lists to a dict
forc_l = []
for i, j in zip(symb_l, temp_l):
    forc_l.append([i, j])
rez = dict(zip(time_l, forc_l))
# combine and format the rezult. can this dict be printed simpler?
wew = ''
for key in sorted(rez):
    wew += re.sub("T", ", ", key) + str(rez[key])
wew = re.sub("'", "", wew)
wew = re.sub("\[", ": ", wew)
wew = re.sub("\]", "°\n", wew)
# print the rezult
print wew

但我想一定有一些更好、更智能的方法?大多数情况下,我对从 XML 中收集属性感兴趣,实际上,我的方式对我来说似乎相当愚蠢。另外,有没有更简单的方法可以很好地打印出字典{'a': '[b, c]'}

如果有任何提示或建议,将不胜感激。

【问题讨论】:

  • “另外,有没有更简单的方法可以很好地打印出 dict {'a': '[b, c]'}” - 试试 pprint跨度>

标签: python xml beautifulsoup


【解决方案1】:
from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
    content = f.read() # xml content stored in this variable
soup = BeautifulSoup(content, "lxml")
for values in soup.findAll("time"):
    print("{} : {}, {}°".format(values["from"], values.find("symbol")["name"], values.find("temperature")["value"]))

输出:

2017-07-29T08:00:00 : Cloudy, 15°
2017-07-29T09:00:00 : Partly cloudy, 17°

【讨论】:

    【解决方案2】:

    另外,你也可以通过导入xml.dom.minidom模块来获取xml数据。 这是您想要的数据:

    from xml.dom.minidom import parse
    doc = parse("path/to/xmlfile.xml") # parse an XML file by name
    itemlist = doc.getElementsByTagName('time')
    for items in itemlist:
        from_tag =  items.getAttribute('from')    
        symbol_list = items.getElementsByTagName('symbol') 
        symbol_name = [d.getAttribute('name') for d in symbol_list ][0] 
        temperature_list = items.getElementsByTagName('temperature') 
        temp_value = [d.getAttribute('value') for d in temperature_list ][0]
        print ("{} :  {}, {}°". format(from_tag, symbol_name, temp_value))
    

    输出如下:

    2017-07-29T08:00:00 :  Cloudy, 15°
    2017-07-29T09:00:00 :  Partly cloudy, 17°
    

    希望有用。

    【讨论】:

      【解决方案3】:

      在这里您还可以使用内置模块的替代方式(我使用的是 python 3.6.2):

      import xml.etree.ElementTree as et # this is built-in module in python3
      tree = et.parse("sample.xml")
      root = tree.getroot()
      for temp in root.iter("time"): # iterate time element in xml
          print(temp.attrib["from"], end=": ") # prints attribute of time element
          for sym in temp.iter("symbol"): # iterate symbol element within time element
              print(sym.attrib["name"], end=", ")
          for t in temp.iter("temperature"): # iterate temperature element within time element
              print(t.attrib["value"], end="°\n")
      

      【讨论】:

        猜你喜欢
        • 2021-01-29
        • 1970-01-01
        • 2019-10-15
        • 2023-03-09
        • 2011-02-06
        • 2010-12-13
        相关资源
        最近更新 更多