【问题标题】:How to find nested html elements using lxml in Python?如何在 Python 中使用 lxml 查找嵌套的 html 元素?
【发布时间】:2021-04-30 17:24:54
【问题描述】:

我正在尝试抓取以下 html。

有多个 div,其中 class="review-card"。

每个 div 总是包含一个脚本元素,其中 data-initial-state="data-always-exist",有时还包含一个脚本元素,其中 data-initial-state="data-may-not-exist"。

我想从这两个脚本元素中检索数据。当第二个不存在时,我想返回一个特定的值,例如0.

正如您在下面的代码中看到的,我设法找到了“retrieve-card”div 元素。但是,我无法检索每个 div 元素内的脚本元素。我的代码总是返回一个列表而不是单个元素。我做错了什么?

<html>
    <body>
        <main>
            <div class="review-list">
                <div class="review-card">
                    <article class="review">
                        <script type="application.json" data-initial-state="data-always-exist">
                        {"reviewBody":"Brilliant value","stars":5}
                        </script>
                        <section class="review__content">
                            <div class="content">
                                <script type="application.json" data-initial-state="data-may-not-exist">
                                    {"isVerified":true,"verificationSource":"invitation"}
                                </script>
                            </div>
                        </section>
                    </article>
                </div>
                <div class="review-card">
                    <article class="review">
                            <script type="application.json data-initial-state="data-always-exist">
                                {"reviewBody":"Brilliant value","stars":5}
                            </script>
                    </article>
                </div>
                <div class="review-card">
                    <article class="review">
                        <script type="application.json" data-initial-state="data-always-exist">
                        {"reviewBody":"Great","stars":4}
                        </script>
                        <section class="review__content">
                            <div class="content">
                                <script type="application.json" data-initial-state="data-may-not-exist">
                                    {"isVerified":false,"verificationSource":"invitation"}
                                </script>
                            </div>
                        </section>
                    </article>
                </div>

            </div>
        </main>
    </body>
</html>

我尝试了以下方法:

from lxml import html
import requests

page = requests.get('http://somewebsite.com')
tree = html.fromstring(page.content)

#finds the review list
review_list = tree.xpath('//div[@class="review-list"]')

#finds all the review cards
review_cards = review_list[0].xpath('//div[contains(@class,"review-card")]')

for card in review_cards:
   
   #this part of the code does not work as intended -returns a list vs a single items.
   data_always_exist = card.xpath("//script[starts-with(@data-initial-state, 'data-always-exist')]")
   data_not_always_exist = card.xpath("//script[starts-with(@data-initial-state, 'data-may-not-exist')]")

【问题讨论】:

  • 可以使用beautifulsoup吗?
  • @AndrejKesely 作为最后一个选项是的,但我更喜欢 lxml 解决方案。
  • 我添加了 BeautifulSoup 和 lxml 版本

标签: python web-scraping lxml


【解决方案1】:

使用beautifulsoup的解决方案:

import requests
from bs4 import BeautifulSoup


soup = BeautifulSoup(requests.get("http://somewebsite.com").content, "lxml")

for card in soup.select(".review-card"):
    print("data-always-exist:")
    d = card.select_one('[data-initial-state="data-always-exist"]')
    if d:
        print(d.contents[0].strip())
    print("data-may-not-exist:")
    d = card.select_one('[data-initial-state="data-may-not-exist"]')
    if d:
        print(d.contents[0].strip())

    print("-" * 80)

打印:

data-always-exist:
{"reviewBody":"Brilliant value","stars":5}
data-may-not-exist:
{"isVerified":true,"verificationSource":"invitation"}
--------------------------------------------------------------------------------
data-always-exist:
{"reviewBody":"Brilliant value","stars":5}
data-may-not-exist:
--------------------------------------------------------------------------------
data-always-exist:
{"reviewBody":"Great","stars":4}
data-may-not-exist:
{"isVerified":false,"verificationSource":"invitation"}
--------------------------------------------------------------------------------

带有lxml 的版本(在您的 XPath 中使用点 (.)):

# ...
tree = html.fromstring(page.content)
cards = tree.xpath('//div[contains(@class,"review-card")]')


for card in cards:

    # this part of the code does not work as intended -returns a list vs a single items.
    data_always_exist = card.xpath(
        ".//script[starts-with(@data-initial-state, 'data-always-exist')]"
    )
    data_not_always_exist = card.xpath(
        ".//script[starts-with(@data-initial-state, 'data-may-not-exist')]"
    )

    print(data_always_exist)
    print(data_not_always_exist)
    print("-" * 80)

打印:

[<Element script at 0x7fc202aadd10>]
[<Element script at 0x7fc202aade50>]
--------------------------------------------------------------------------------
[<Element script at 0x7fc202aadea0>]
[]
--------------------------------------------------------------------------------
[<Element script at 0x7fc202aade50>]
[<Element script at 0x7fc202aadea0>]
--------------------------------------------------------------------------------

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2011-02-07
    • 1970-01-01
    • 1970-01-01
    • 2012-01-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-03-06
    相关资源
    最近更新 更多