【发布时间】:2020-07-22 03:46:52
【问题描述】:
从 4.9.0 版开始,BeautifulSoup4 改变了[0] text prop 的工作方式,现在忽略了嵌入脚本的内容:
= 4.9.0 (20200405)
...
* Embedded CSS and Javascript is now stored in distinct Stylesheet and
Script tags, which are ignored by methods like get_text() since most
people don't consider this sort of content to be 'text'. This
feature is not supported by the html5lib treebuilder. [bug=1868861]
所以现在不再可能使用soup.find('script').text 从html <script>wanted text</script> 中提取wanted text。
现在提取它的首选方法是什么?我宁愿不要手动从str(script) 中删除<script> 和</script>。
[0] - https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/CHANGELOG
【问题讨论】:
标签: python html web-scraping beautifulsoup