使用 BeautifulSoup (4.9.0) 提取脚本内容答案

【问题标题】：Extract script contents with BeautifulSoup (4.9.0)使用 BeautifulSoup (4.9.0) 提取脚本内容
【发布时间】：2020-07-22 03:46:52
【问题描述】：

从 4.9.0 版开始，BeautifulSoup4 改变了[0] text prop 的工作方式，现在忽略了嵌入脚本的内容：

= 4.9.0 (20200405)
...
* Embedded CSS and Javascript is now stored in distinct Stylesheet and
  Script tags, which are ignored by methods like get_text() since most
  people don't consider this sort of content to be 'text'. This
  feature is not supported by the html5lib treebuilder. [bug=1868861]

所以现在不再可能使用soup.find('script').text 从html <script>wanted text</script> 中提取wanted text。

现在提取它的首选方法是什么？我宁愿不要手动从str(script) 中删除<script> 和</script>。

[0] - https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/CHANGELOG

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

您可以尝试使用脚本标签的contents，如下所示：

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.yourwebsite.com")
soup = BeautifulSoup(r.content, "html.parser")

for script in soup.find_all('script'):
    if len(script.contents):
        print(script.contents[0])

【讨论】：