【问题标题】：extract text from h1 and id with python beautiful soup用python beautiful soup从h1和id中提取文本
【发布时间】：2021-04-20 13:35:34
【问题描述】：

我正在尝试从 HTML id="itemSummaryPrice" 中提取文本，但我无法弄清楚。

html = """
<div id="itemSummaryContainer" class="content">
            <div id="itemSummaryMainWrapper">
                <div id="itemSummaryImage">
                    <img src="https://img.rl.insider.gg/itemPics/large/endo.fgreen.891c.jpg" alt="Forest Green Endo">
                </div>
                <h2 id="itemSummaryTitle">Item Report</h2>
                <h2 id="itemSummaryDivider"> | </h2>
                <h2 id="itemSummaryDate">Friday, January 15, 2021, 8:38 AM EST</h2>
                <div id="itemSummaryBlankSpace"></div>
                <h1 id="itemSummaryName">
                    <span id="itemNameSpan" style="color: rgb(88, 181, 73);"><span>Forest Green</span> <span>Endo</span></span>
                </h1>
                **<h1 id="itemSummaryPrice" style="color: rgb(88, 181, 73);">200 - 300</h1>**
            </div>
        </div>
"""

我的代码：

price_checker_site = requests.get(price_checker_url + match2)
price_checker_site_soup = BeautifulSoup(price_checker_site, 'html.parser')
price_check_item = price_checker_site_soup.find('h1', {'id': 'itemSummaryPrice'})

print(price_check_item)

<h1 id="itemSummaryPrice"></h1>

我要提取的内容：

<h1 id="itemSummaryPrice">200 - 300</h1>
OR
<h1 id="itemSummaryPrice" style="color: rgb(88, 181, 73);">200 - 300</h1>
OR
200 - 300

【问题讨论】：

我无法用附加的 HTML 重现这个 - 当我运行它时它返回 <h1 id="itemSummaryPrice" style="color: rgb(88, 181, 73);">200 - 300</h1>。您的代码中的其他地方一定有问题，例如当您请求 HTML 时。你检查过返回的 HTML 是什么样子的吗？
猜测，我会说元素文本是由 javascript 填充的。可以分享你的网址吗？
如果是这种情况@carlos（很可能），请考虑使用selenium 而不是requests。
@JustinEzequiel rl.insider.gg/en/psn/octane/grey 这是一个与我在帖子中使用的网站相似的网站。
检查了你的 URL，<h1 id="itemSummaryPrice"></h1> 是你在没有 javascript 的情况下得到的。因此，您需要使用 javascript 的 Selenium，或者您需要找到 javascript 发出的请求，然后使用 requests 复制该请求。

标签： python html beautifulsoup

【解决方案1】：

因为那时我还不能给 cmets 一个答案。你不应该在 price_check_item 后面调用 .text 吗？

所以python代码是这样的。

price_checker_site = requests.get(price_checker_url + match2)
price_checker_site_soup = BeautifulSoup(price_checker_site, 'html.parser')
price_check_item = price_checker_site_soup.find('h1', {'id': 'itemSummaryPrice'})

print(price_check_item.text) #Also possible to do print(price_check_item.text.strip())

我认为这是正确的答案。可惜现在不能测试。今晚将为您检查我的代码。

【讨论】：

【解决方案2】：

正如 cmets 中所讨论的，您查找的内容是使用 JavaScript 动态加载的。因此，您必须要么使用像 Selenium 这样的库来动态运行 JS，要么找出数据的加载位置/方式并复制它。

方法一：使用硒

from selenium import webdriver

url = 'https://rl.insider.gg/en/psn/octane/grey'
driver = webdriver.Firefox(executable_path='YOUR PATH')  # or Chrome
driver.get(url)
price = driver.find_element_by_id('itemSummaryPrice')
print(price.text)

在这种情况下很简单，您只需发出请求并使用find_element_by_id 来获取您想要的数据。

方法二：追踪与复制

如果您查看浏览器的调试器，您可以找到itemSummaryPrice 的设置位置/方式。

特别是，我们发现它的集合在https://rl.insider.gg/js/itemDetails.js 中使用了$('#itemSummaryPrice').text(itemData.currentPriceRange)。

下一步是找出itemData 的来源。事实证明，这不是来自其他文件或 API 调用。相反，它似乎是在 HTML 源代码本身中硬编码的（可能是在服务器端加载的）。

如果您检查源代码，您会发现 itemData 只是在页面本身的 script 标记内的一行中定义的 JSON 对象。

您可以在这里使用两种不同的方法。

使用 Selenium 的 execute_script 提取数据。这为您提供了现成格式的 JSON 对象。然后，您可以将其编入索引以获取 currentPriceRange。

from selenium import webdriver

driver = webdriver.Firefox(executable_path='YOUR PATH')  # or Chrome
driver.get(url)
itemData = driver.execute_script('return itemData')
print(itemData['currentPriceRange'])

方法 2.1：Selenium 的替代品

或者，您可以使用传统方法在 Python 中提取它。然后，使用 json.loads 将其转换为可用的 Python 对象，然后索引该对象以提取 currentPriceRange - 这将为您提供所需的输出。

import re
import requests
import json

# Download & convert the response content to a list
url = 'https://rl.insider.gg/en/psn/octane/grey'
site = str(requests.get(url).content).split('\\n')

# Extract the line containing 'var itemData'
itemData = [s for s in site if re.match(r'^\s*var itemData', s)][0].strip()

# Remove 'var itemData' and ';' from that line
# This leaves valid JSON which can be converted from a string using json.loads
itemData = json.loads(re.sub(r'var itemData = |;', '', itemData))

# Index the data to extract the 'currentPriceRange'
print(itemData['currentPriceRange'])

这种方法不需要 Selenium 来运行 JavaScript，也不需要 BeautifulSoup 来解析 HTML。它确实依赖于以某种方式初始化的itemData。如果该网站的开发人员决定改变这种方式，您将不得不稍作调整以做出回应。

我应该使用哪种方法？

如果您真正想要的只是价格范围而不是其他，请使用第一种方法。如果您也对其他数据感兴趣，最好从源中提取完整的 itemData JSON 并使用它。

有人可能会争辩说 Selenium 比手动解析 HTML 更可靠，但在这种情况下你可能没问题。在这两种情况下，您都假设在某处定义了一些itemData。如果格式确实略有变化，则解析可能会中断。另一个缺点是如果部分数据依赖于 JS 函数调用——Selenium 将执行该函数，而手动解析则无法解释。（这里不是这种情况，但它可以改变）。

【讨论】：