【问题标题】:Extract script contents with BeautifulSoup (4.9.0)使用 BeautifulSoup (4.9.0) 提取脚本内容
【发布时间】:2020-07-22 03:46:52
【问题描述】:

从 4.9.0 版开始,BeautifulSoup4 改变了[0] text prop 的工作方式,现在忽略了嵌入脚本的内容:

= 4.9.0 (20200405)
...
* Embedded CSS and Javascript is now stored in distinct Stylesheet and
  Script tags, which are ignored by methods like get_text() since most
  people don't consider this sort of content to be 'text'. This
  feature is not supported by the html5lib treebuilder. [bug=1868861]

所以现在不再可能使用soup.find('script').text 从html <script>wanted text</script> 中提取wanted text

现在提取它的首选方法是什么?我宁愿不要手动从str(script) 中删除<script></script>

[0] - https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/CHANGELOG

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    您可以尝试使用脚本标签的contents,如下所示:

    import requests
    from bs4 import BeautifulSoup
    
    r = requests.get("https://www.yourwebsite.com")
    soup = BeautifulSoup(r.content, "html.parser")
    
    for script in soup.find_all('script'):
        if len(script.contents):
            print(script.contents[0])
    

    【讨论】:

      猜你喜欢
      • 2020-11-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-11-29
      • 1970-01-01
      • 2012-02-13
      • 2017-01-29
      • 2011-08-25
      相关资源
      最近更新 更多