【问题标题】:Removing unwanted elements from GPX (XML) files using BeautifulSoup使用 BeautifulSoup 从 GPX (XML) 文件中删除不需要的元素
【发布时间】:2021-10-23 01:21:16
【问题描述】:

我有许多带有无关标签的 gpx (XML) 文件,这些标签毫无用处,我想删除它们,然后重写文件。

使用 BeautifulSoup (v4) 打开和解析它们是微不足道的,但我现在正在尝试寻找如何删除标签。

在给出的示例片段中,我要删除的标签是trkseg 中的整个<name>n</name>(其中n 是整数)标签。 (即不仅仅是价值)

<?xml version="1.0" encoding="utf-8"?>
<gpx version="1.1">
<metadata>
<name>A Name</name>
<desc>A Description</desc>
<author>
<name>Another Name</name>
<email>emailaddr@nonexistentdomain.com</email>
</author>
<time>2018-10-27T17:58:45Z</time>
</metadata>
<trk>
<desc>
"Walk Number", "Start Date", "Start Time", "Elapsed Time", "Miles","Kilometers", "Steps", "Calories"
2,"27 Oct 2018","1:18:05 pm","4 hours15 minutes29 seconds",13.37,21.52,33436,1,212
</desc>
<trkseg>
<name>2</name>
<trkpt lat="32.01333283" lon="-28.61624884">
<ele>274.0</ele>
<time>2018-10-27T13:18:05Z</time>
</trkpt>
<name>2</name>
<trkpt lat="32.01325155" lon="-28.61617729">
<ele>260.0</ele>
<time>2018-10-27T13:18:32Z</time>
</trkpt>
<name>2</name>
<trkpt lat="32.01317277" lon="-28.6162623">
<ele>264.0</ele>
<time>2018-10-27T13:18:38Z</time>
</trkpt>
<name>2</name>
<trkpt lat="32.01308939" lon="-28.61634673">
<ele>272.0</ele>
<time>2018-10-27T13:18:46Z</time>
</trkpt>
<name>2</name>
<trkpt lat="32.01300121" lon="-28.61649587">
<ele>270.0</ele>
<time>2018-10-27T13:18:54Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>

关于 SO 和其他地方删除标签的答案似乎与此用例不匹配,我还没有发现 BS 文档有帮助(我确定是我的缺陷而不是文档的缺陷)。

(因为文件相当简单且格式一致,我可以使用 awksed 删除这些标签,但我想知道如何在BS,以防我将来遇到不那么简单的事情)

不管怎样,我只做到了这一点:

# "gpx" is the data fragment given above
from bs4 import BeautifulSoup as BS 

gpxml = BS(gpx, 'xml')

# and I can do this to find all the unwanted tags in <trkseg>

unwanted = gpxml.trkseg.name
print(unwanted)
[<name>2</name>, <name>2</name>, <name>2</name>, <name>2</name>, <name>2</name>]

# and I can do this to iterate the trkseg and print trkpt & name by turn

for el in gpxml.trkseg:
    print(el)

但在这一点上,我已经无法理解了。

我想我可能需要以某种方式使用 BeautifulSoup 的 decompose() 方法?

我使用 BeautifulSoup 是因为我发现 lxml.etree 更难理解(无论从职业还是性质来看,我都不是程序员)

【问题讨论】:

    标签: python xml beautifulsoup


    【解决方案1】:

    使用元素树

    import xml.etree.ElementTree as ET
    
    xml = '''<?xml version="1.0" encoding="UTF-8"?>
    <gpx version="1.1">
       <metadata>
          <name>A Name</name>
          <desc>A Description</desc>
          <author>
             <name>Another Name</name>
             <email>emailaddr@nonexistentdomain.com</email>
          </author>
          <time>2018-10-27T17:58:45Z</time>
       </metadata>
       <trk>
          <desc>"Walk Number", "Start Date", "Start Time", "Elapsed Time", "Miles","Kilometers", "Steps", "Calories"
    2,"27 Oct 2018","1:18:05 pm","4 hours15 minutes29 seconds",13.37,21.52,33436,1,212</desc>
          <trkseg>
             <name>2</name>
             <trkpt lat="32.01333283" lon="-28.61624884">
                <ele>274.0</ele>
                <time>2018-10-27T13:18:05Z</time>
             </trkpt>
             <name>2</name>
             <trkpt lat="32.01325155" lon="-28.61617729">
                <ele>260.0</ele>
                <time>2018-10-27T13:18:32Z</time>
             </trkpt>
             <name>2</name>
             <trkpt lat="32.01317277" lon="-28.6162623">
                <ele>264.0</ele>
                <time>2018-10-27T13:18:38Z</time>
             </trkpt>
             <name>2</name>
             <trkpt lat="32.01308939" lon="-28.61634673">
                <ele>272.0</ele>
                <time>2018-10-27T13:18:46Z</time>
             </trkpt>
             <name>2</name>
             <trkpt lat="32.01300121" lon="-28.61649587">
                <ele>270.0</ele>
                <time>2018-10-27T13:18:54Z</time>
             </trkpt>
          </trkseg>
       </trk>
    </gpx>'''
    
    root = ET.fromstring(xml)
    trkseg_lst = root.findall('.//trkseg')
    for entry in trkseg_lst:
        for element in list(entry):
            if element.tag == 'name':
                entry.remove(element)
    ET.dump(root)
    

    输出

    <gpx version="1.1">
       <metadata>
          <name>A Name</name>
          <desc>A Description</desc>
          <author>
             <name>Another Name</name>
             <email>emailaddr@nonexistentdomain.com</email>
          </author>
          <time>2018-10-27T17:58:45Z</time>
       </metadata>
       <trk>
          <desc>"Walk Number", "Start Date", "Start Time", "Elapsed Time", "Miles","Kilometers", "Steps", "Calories"
    2,"27 Oct 2018","1:18:05 pm","4 hours15 minutes29 seconds",13.37,21.52,33436,1,212</desc>
          <trkseg>
             <trkpt lat="32.01333283" lon="-28.61624884">
                <ele>274.0</ele>
                <time>2018-10-27T13:18:05Z</time>
             </trkpt>
             <trkpt lat="32.01325155" lon="-28.61617729">
                <ele>260.0</ele>
                <time>2018-10-27T13:18:32Z</time>
             </trkpt>
             <trkpt lat="32.01317277" lon="-28.6162623">
                <ele>264.0</ele>
                <time>2018-10-27T13:18:38Z</time>
             </trkpt>
             <trkpt lat="32.01308939" lon="-28.61634673">
                <ele>272.0</ele>
                <time>2018-10-27T13:18:46Z</time>
             </trkpt>
             <trkpt lat="32.01300121" lon="-28.61649587">
                <ele>270.0</ele>
                <time>2018-10-27T13:18:54Z</time>
             </trkpt>
          </trkseg>
       </trk>
    </gpx>
    

    【讨论】:

      【解决方案2】:

      BeautifulSoup 解决方案。

      使用decompose()方法从&lt;trkseg&gt;标签中迭代删除&lt;name&gt;标签。

      from bs4 import BeautifulSoup
      gpx = '''
      <?xml version="1.0" encoding="utf-8"?>
      <gpx version="1.1">
      <metadata>
      <name>A Name</name>
      <desc>A Description</desc>
      <author>
      <name>Another Name</name>
      <email>emailaddr@nonexistentdomain.com</email>
      </author>
      <time>2018-10-27T17:58:45Z</time>
      </metadata>
      <trk>
      <desc>
      "Walk Number", "Start Date", "Start Time", "Elapsed Time", "Miles","Kilometers", "Steps", "Calories"
      2,"27 Oct 2018","1:18:05 pm","4 hours15 minutes29 seconds",13.37,21.52,33436,1,212
      </desc>
      <trkseg>
      <name>2</name>
      <trkpt lat="32.01333283" lon="-28.61624884">
      <ele>274.0</ele>
      <time>2018-10-27T13:18:05Z</time>
      </trkpt>
      <name>2</name>
      <trkpt lat="32.01325155" lon="-28.61617729">
      <ele>260.0</ele>
      <time>2018-10-27T13:18:32Z</time>
      </trkpt>
      <name>2</name>
      <trkpt lat="32.01317277" lon="-28.6162623">
      <ele>264.0</ele>
      <time>2018-10-27T13:18:38Z</time>
      </trkpt>
      <name>2</name>
      <trkpt lat="32.01308939" lon="-28.61634673">
      <ele>272.0</ele>
      <time>2018-10-27T13:18:46Z</time>
      </trkpt>
      <name>2</name>
      <trkpt lat="32.01300121" lon="-28.61649587">
      <ele>270.0</ele>
      <time>2018-10-27T13:18:54Z</time>
      </trkpt>
      </trkseg>
      </trk>
      </gpx>
      '''
      
      soup = BeautifulSoup(gpx, 'xml')
      t = soup.find('trkseg')
      while t.find('name') is not None:
          t.find('name').decompose()
      
      print(soup)
      

      输出:

      <?xml version="1.0" encoding="utf-8"?><gpx version="1.1">
      <metadata>
      <name>A Name</name>
      <desc>A Description</desc>
      <author>
      <name>Another Name</name>
      <email>emailaddr@nonexistentdomain.com</email>
      </author>
      <time>2018-10-27T17:58:45Z</time>
      </metadata>
      <trk>
      <desc>
      "Walk Number", "Start Date", "Start Time", "Elapsed Time", "Miles","Kilometers", "Steps", "Calories"
      2,"27 Oct 2018","1:18:05 pm","4 hours15 minutes29 seconds",13.37,21.52,33436,1,212
      </desc>
      <trkseg>
      
      <trkpt lat="32.01333283" lon="-28.61624884">
      <ele>274.0</ele>
      <time>2018-10-27T13:18:05Z</time>
      </trkpt>
      
      <trkpt lat="32.01325155" lon="-28.61617729">
      <ele>260.0</ele>
      <time>2018-10-27T13:18:32Z</time>
      </trkpt>
      
      <trkpt lat="32.01317277" lon="-28.6162623">
      <ele>264.0</ele>
      <time>2018-10-27T13:18:38Z</time>
      </trkpt>
      
      <trkpt lat="32.01308939" lon="-28.61634673">
      <ele>272.0</ele>
      <time>2018-10-27T13:18:46Z</time>
      </trkpt>
      
      <trkpt lat="32.01300121" lon="-28.61649587">
      <ele>270.0</ele>
      <time>2018-10-27T13:18:54Z</time>
      </trkpt>
      </trkseg>
      </trk>
      </gpx>
      

      【讨论】:

      • 如果你做 soup = BeautifulSoup(gpx, 'xml') (nb 'xml' not 'lxml') 那么 html 标签不会被添加(在我使用的版本中)
      • 好的。我不知道。但我看到使用 xml 解析器将这个 - &lt;?xml version="1.0" encoding="utf-8"?&gt; 添加到汤中。最后的汤有两个 XML 序言。
      • head 和 body 标签取决于解析器的选择,在您的情况下为 lxml。使用 xml 你只得到 xml-header
      【解决方案3】:

      要删除标签,您应该使用decompose 方法。然后,通过应用过滤器,您可以选择那些满足您条件的标签,&lt;name&gt;2&lt;/name&gt;

      NB decompose 作用于整个树,因此您的原始对象将被更改

      from bs4 import BeautifulSoup as BS
      
      gpxml = '' # from above
      
      soup = BS(gpxml, 'xml')
      
      for tag in soup.find_all('name', string=True):
          if str(tag.string) == '2':
              tag.decompose()
      
      soup = soup.gpx.extract() # skip the xml-header
      print(soup)
      

      【讨论】:

      • 在Ram的解决方案中指出,bs4默认添加文档的“headers”。这些是依赖于解析器的。对于 xml,只会像 &lt;?xml version="1.0" encoding="utf-8"?&gt; 这样的 smt。如果不需要,例如可以使用extract 方法删除
      【解决方案4】:

      当您需要修改或转换 XML 文件时,请考虑使用专用行业语言 XSLT。专门针对您的用例,运行 Identity Transform 后跟一个空模板,以便使用特定条件逻辑从树中删除节点。

      Python 可以运行带有 lxml 包的 XSLT 1.0。这种方法不需要循环!

      import lxml.etree as lx
      
      # PARSE XML AND XSLT
      doc = lx.parse("Input.gpx")
      
      style = lx.fromstring(
      b'''
      <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
          <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
          <xsl:strip-space elements="*"/>
      
          <!-- IDENTITY TRANSFORM -->
          <xsl:template match="node()|@*">
           <xsl:copy>
             <xsl:apply-templates select="node()|@*"/>
           </xsl:copy>
          </xsl:template>
          
          <!-- EMPTY TEMPLATE -->
          <xsl:template match="name[text()=number(text())]"/>
          
      </xsl:stylesheet>
      ''')
      
      # INITIALIZE TRANSFORMER AND APPLY IT
      transformer = lx.XSLT(style)
      result = transformer(doc)
      

      输出

      print(result)
      
      <?xml version="1.0"?>
      <gpx version="1.1">
        <metadata>
          <name>A Name</name>
          <desc>A Description</desc>
          <author>
            <name>Another Name</name>
            <email>emailaddr@nonexistentdomain.com</email>
          </author>
          <time>2018-10-27T17:58:45Z</time>
        </metadata>
        <trk>
          <desc>
      "Walk Number", "Start Date", "Start Time", "Elapsed Time", "Miles","Kilometers", "Steps", "Calories"
      2,"27 Oct 2018","1:18:05 pm","4 hours15 minutes29 seconds",13.37,21.52,33436,1,212
      </desc>
          <trkseg>
            <trkpt lat="32.01333283" lon="-28.61624884">
              <ele>274.0</ele>
              <time>2018-10-27T13:18:05Z</time>
            </trkpt>
            <trkpt lat="32.01325155" lon="-28.61617729">
              <ele>260.0</ele>
              <time>2018-10-27T13:18:32Z</time>
            </trkpt>
            <trkpt lat="32.01317277" lon="-28.6162623">
              <ele>264.0</ele>
              <time>2018-10-27T13:18:38Z</time>
            </trkpt>
            <trkpt lat="32.01308939" lon="-28.61634673">
              <ele>272.0</ele>
              <time>2018-10-27T13:18:46Z</time>
            </trkpt>
            <trkpt lat="32.01300121" lon="-28.61649587">
              <ele>270.0</ele>
              <time>2018-10-27T13:18:54Z</time>
            </trkpt>
          </trkseg>
        </trk>
      </gpx>
      

      保存到文件

      with open('Output.gpx', 'wb') as f: 
          f.write(result)
      

      【讨论】:

        猜你喜欢
        • 2011-08-24
        • 2022-01-05
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-10-03
        • 2020-12-19
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多