【问题标题】:Access grandchildren with beautifull soup select用美丽的汤选择访问孙子
【发布时间】:2020-12-01 04:14:58
【问题描述】:

我已经为此苦苦挣扎了一段时间。

给定以下 XML 文件

<?xml version='1.0' encoding='UTF-8'?>
<html>
    <body>
        <feed xml:base="https:newrecipes.org"
            xmlns="http://www.w3.org/2005/Atom"
            xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
            xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
            <id>https://recipes.com</id>
            <title>Cuisine</title>
            <updated>2020-08-10T08:48:56.800Z</updated>
            <link href="Cuisine" rel="self" title="Cuisine"/>
            <entry>
                <id>https://www.cuisine.org(53198770598313985)</id>
                <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                <title></title>
                <updated>1970-01-01T00:00:00.000Z</updated>
                <content type="application/xml">
                    <m:properties>
                        <d:id m:type="Edm.Int64">53198770598313985</d:id>
                        <d:name m:type="Edm.String">American</d:name>
                    </m:properties>
                </content>
            </entry>
            <entry>
                <id>https://www.cuisine.org(53198770598313986)</id>
                <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                <title></title>
                <updated>1970-01-01T00:00:00.000Z</updated>
                <content type="application/xml">
                    <m:properties>
                        <d:id m:type="Edm.Int64">53198770598313986</d:id>
                        <d:name m:type="Edm.String">Asian</d:name>
                    </m:properties>
                </content>
            </entry>
        </feed>
      </body>
     </html>
    

使用 BeautifulSoup 我想出了以下解决方案,以便使用子组合器从条目标签中获取 id。

from bs4 import BeautifulSoup
import re
# Make a BS object to parse the xml string.
xml_soup = BeautifulSoup(xml_string, features="lxml")

# Use the child combinator to select the ids that are direct descendants of entry
cuisine_ids_unparsed = xml_soup.select("entry > content")

# Get the ids from the Tag value using regex.
# Then return the first occurrence of the regex found.
cuisine_ids = [re.findall(r"\((.*)\)", cuisine_id.text)[0] for cuisine_id in cuisine_ids_unparsed]

这将返回文件中 &lt;id&gt; 标记括号中的所有美食 ID。但我也想访问每个entry 中的properties。因为这些包含菜品的 id 和名称,无需任何解析。 不幸的是,使用 css 中的 Child 组合器(>)我无法更深入,我想知道是否有更好的方法,而不是迭代元素以提取值。比如:

cuisine_ids_unparsed = xml_soup.select("entry > content > properties > id")

检索所有 id 和

cuisine_names_unparsed = xml_soup.select("entry > content > properties > name")

检索所有名称。

【问题讨论】:

    标签: python css xml beautifulsoup


    【解决方案1】:

    使用了一些@Andrej Kesely 的建议,但您可以使用正则表达式来代替zip。:

    txt = '''<?xml version='1.0' encoding='UTF-8'?>
    <html>
        <body>
            <feed xml:base="https:newrecipes.org"
                xmlns="http://www.w3.org/2005/Atom"
                xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
                xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
                <id>https://recipes.com</id>
                <title>Cuisine</title>
                <updated>2020-08-10T08:48:56.800Z</updated>
                <link href="Cuisine" rel="self" title="Cuisine"/>
                <entry>
                    <id>https://www.cuisine.org(53198770598313985)</id>
                    <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                    <title></title>
                    <updated>1970-01-01T00:00:00.000Z</updated>
                    <content type="application/xml">
                        <m:properties>
                            <d:id m:type="Edm.Int64">53198770598313985</d:id>
                            <d:name m:type="Edm.String">American</d:name>
                        </m:properties>
                    </content>
                </entry>
                <entry>
                    <id>https://www.cuisine.org(53198770598313986)</id>
                    <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                    <title></title>
                    <updated>1970-01-01T00:00:00.000Z</updated>
                    <content type="application/xml">
                        <m:properties>
                            <d:id m:type="Edm.Int64">53198770598313986</d:id>
                            <d:name m:type="Edm.String">Asian</d:name>
                        </m:properties>
                    </content>
                </entry>
            </feed>
          </body>
    </html>'''
    
    
    xml_soup = BeautifulSoup(txt, features="xml")
    
    properties_unparsed = xml_soup.select('entry > content > m|properties')
    
    for prop in properties_unparsed:
        # Extract the id and name from the text of the property
        # The id is going to be a sequence of numbers
        # the name a sequence of letters.
        tup = re.match(r'(\d+)(\w+)', prop.text).groups()
        id_ = tup[0]
        name = tup[1]
        print(id_, name)
    

    【讨论】:

      【解决方案2】:

      您可以使用zip() 函数将两个标签“捆绑”在一起:

      import re
      from bs4 import BeautifulSoup
      
      
      txt = '''<?xml version='1.0' encoding='UTF-8'?>
      <html>
          <body>
              <feed xml:base="https:newrecipes.org"
                  xmlns="http://www.w3.org/2005/Atom"
                  xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
                  xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
                  <id>https://recipes.com</id>
                  <title>Cuisine</title>
                  <updated>2020-08-10T08:48:56.800Z</updated>
                  <link href="Cuisine" rel="self" title="Cuisine"/>
                  <entry>
                      <id>https://www.cuisine.org(53198770598313985)</id>
                      <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                      <title></title>
                      <updated>1970-01-01T00:00:00.000Z</updated>
                      <content type="application/xml">
                          <m:properties>
                              <d:id m:type="Edm.Int64">53198770598313985</d:id>
                              <d:name m:type="Edm.String">American</d:name>
                          </m:properties>
                      </content>
                  </entry>
                  <entry>
                      <id>https://www.cuisine.org(53198770598313986)</id>
                      <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                      <title></title>
                      <updated>1970-01-01T00:00:00.000Z</updated>
                      <content type="application/xml">
                          <m:properties>
                              <d:id m:type="Edm.Int64">53198770598313986</d:id>
                              <d:name m:type="Edm.String">Asian</d:name>
                          </m:properties>
                      </content>
                  </entry>
              </feed>
            </body>
      </html>'''
      
      soup = BeautifulSoup(txt, 'xml')
      
      
      for id_, name in zip(soup.select('entry > id'), soup.select('entry > content > m|properties > d|name')):
          print(re.search(r'\((.*?)\)', id_.text).group(1))
          print(name.text)
          print('-' * 80)
      

      打印:

      53198770598313985
      American
      --------------------------------------------------------------------------------
      53198770598313986
      Asian
      --------------------------------------------------------------------------------
      

      【讨论】:

      • 嗨安德烈!感谢你的回复!您使用的是哪个 Python 版本?因为这部分:soup.select('entry &gt; content &gt; m|properties &gt; d|name') 为我返回一个空列表:/
      • @Jack 确保使用最新版本的 bs4xml 解析器。
      猜你喜欢
      • 2023-04-03
      • 2014-09-03
      • 2021-07-14
      • 2011-08-07
      • 1970-01-01
      • 1970-01-01
      • 2014-05-28
      • 2013-10-30
      相关资源
      最近更新 更多