【问题标题】:Unable to extract the content of script tag using BeautifulSoup无法使用 BeautifulSoup 提取脚本标签的内容
【发布时间】:2020-11-14 14:20:05
【问题描述】:

soup.find('script',type='application/ld+json').text 返回空数据,为什么我无法提取文本。

>>> soup = BeautifulSoup(page.text,'lxml')

>>> soup.find('script',type='application/ld+json').text**
''
>>> soup.find('script',type='application/ld+json')
<script type="application/ld+json">{"@context":"http://schema.org","@type":"Organization","name":"Hamilton Medical Group - Dunkeld","url":"https://www.healthdirect.gov.au/australian-health-services/23000130/hamilton-medical-group-dunkeld/services/dunkeld-3294-sterling","contactPoint":{"@type":"ContactPoint","telephone":"03 5572 2422","email":"","website":"http://www.hamiltonmedicalgroup.net.au","fax":"03 5571 1606"},"address":{"@type":"PostalAddress","streetAddress":"14 Sterling Street","addressLocality":"DUNKELD","addressRegion":"VIC","postalCode":"3294","addressCountry":"AU"}}</script>
>>> json.loads(soup.find('script',type='application/ld+json'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'json' is not defined
>>> import json
>>> json.loads(soup.find('script',type='application/ld+json'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\*******\Python38\lib\json\__init__.py", line 341, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not Tag

【问题讨论】:

  • 您要抓取的 URL 是什么?

标签: python json beautifulsoup


【解决方案1】:

使用.string属性获取&lt;script&gt;数据:

import json
from bs4 import BeautifulSoup


html_text = '''<script type="application/ld+json">{"@context":"http://schema.org","@type":"Organization","name":"Hamilton Medical Group - Dunkeld","url":"https://www.healthdirect.gov.au/australian-health-services/23000130/hamilton-medical-group-dunkeld/services/dunkeld-3294-sterling","contactPoint":{"@type":"ContactPoint","telephone":"03 5572 2422","email":"","website":"http://www.hamiltonmedicalgroup.net.au","fax":"03 5571 1606"},"address":{"@type":"PostalAddress","streetAddress":"14 Sterling Street","addressLocality":"DUNKELD","addressRegion":"VIC","postalCode":"3294","addressCountry":"AU"}}</script>'''

soup = BeautifulSoup(html_text, 'html.parser')
parsed_data = json.loads(soup.find('script',type='application/ld+json').string)

# print parsed data to screen:
print(json.dumps(parsed_data, indent=4))

打印:

{
    "@context": "http://schema.org",
    "@type": "Organization",
    "name": "Hamilton Medical Group - Dunkeld",
    "url": "https://www.healthdirect.gov.au/australian-health-services/23000130/hamilton-medical-group-dunkeld/services/dunkeld-3294-sterling",
    "contactPoint": {
        "@type": "ContactPoint",
        "telephone": "03 5572 2422",
        "email": "",
        "website": "http://www.hamiltonmedicalgroup.net.au",
        "fax": "03 5571 1606"
    },
    "address": {
        "@type": "PostalAddress",
        "streetAddress": "14 Sterling Street",
        "addressLocality": "DUNKELD",
        "addressRegion": "VIC",
        "postalCode": "3294",
        "addressCountry": "AU"
    }
}

【讨论】:

    猜你喜欢
    • 2011-08-25
    • 2020-07-22
    • 2020-04-05
    • 1970-01-01
    • 2012-02-13
    • 1970-01-01
    • 2014-06-16
    • 1970-01-01
    相关资源
    最近更新 更多