【问题标题】:How to get an element inside <head> with BeautifulSoup如何使用 BeautifulSoup 在 <head> 中获取元素
【发布时间】:2022-06-23 00:37:06
【问题描述】:

我有这个网站,我想获取元素“datePublished”,但此信息位于网站内部的字典中。有没有办法通过 BeautifulSoup 获取这些信息?``

这是我说的html部分:

<script type="application/ld+json">{ "@context": "http://schema.org", "@type": "NewsArticle", "mainEntityOfPage": "http://cdn.ampproject.org/article-metadata.html", "headline": "Startup Ebanx demite 340 e amplia 'crise dos unicórnios'", "datePublished": "2022-06-21T14:31:57-03:00", "dateModified": "2022-06-21T18:44:45-03:00", "description": "Empresa de Curitiba cortou 20% do quadro de funcionários diante de mudanças no cenário macroeconômico", "author": { "@type": "Person", "name": "Guilherme Guerra" }, "image": { "@type": "ImageObject", "url": [ "https://img.estadao.com.br/fotos/crop/1200x1200/resources/jpg/6/4/1655828329546.jpg", "https://img.estadao.com.br/fotos/crop/1200x900/resources/jpg/6/4/1655828329546.jpg", "https://img.estadao.com.br/fotos/crop/1200x675/resources/jpg/6/4/1655828329546.jpg" ]}, "publisher": { "@type": "NewsMediaOrganization", "name": "Estadão", "foundingDate" : "1875-01-05", "ethicsPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "missionCoveragePrioritiesPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "diversityPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "correctionsPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "verificationFactCheckingPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "unnamedSourcesPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "sameAs":["https://twitter.com/estadao","https://www.facebook.com/estadao/","https://www.instagram.com/estadao/","https://www.youtube.com/channel/UCrtOL8bJsh-csozGS2aV77Q", "https://plus.google.com/+Estad%C3%A3o"], "logo": { "@type": "ImageObject", "url": "https://statics.estadao.com.br/s2016/portal/logos/logo-estadao-272x59.png", "width": 272, "height": 59 } }, "isAccessibleForFree":"False","hasPart":{"@type":"WebPageElement","isAccessibleForFree":"False","cssSelector":".pw-container"},"isPartOf":{"@type":["CreativeWork","Product"],"name":"Estad\u00e3o","productID":"estadao.com.br:dig_basic"}}</script>

这是获取该信息的有效代码:

import requests
from bs4 import BeautifulSoup

link = "https://link.estadao.com.br/noticias/inovacao,startup-ebanx-demite-340-e-amplia-crise-dos-unicornios,70004097585"
soup = BeautifulSoup(requests.get(link).text, 'html.parser')
url = soup.find("script", type="application/ld+json")
url

【问题讨论】:

    标签: python html beautifulsoup


    【解决方案1】:

    试试这个:

    import json
    
    import requests
    from bs4 import BeautifulSoup
    
    link = "https://link.estadao.com.br/noticias/inovacao,startup-ebanx-demite-340-e-amplia-crise-dos-unicornios,70004097585"
    soup = (
        BeautifulSoup(requests.get(link).text, 'html.parser')
        .find("script", type="application/ld+json").string
    )
    print(json.loads(soup)["datePublished"])
    

    输出:

    2022-06-21T14:31:57-03:00
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-09-09
      • 2021-04-08
      • 2011-07-23
      • 1970-01-01
      • 1970-01-01
      • 2012-02-02
      • 1970-01-01
      相关资源
      最近更新 更多