【问题标题】:Trying to extract meta data from news article试图从新闻文章中提取元数据
【发布时间】:2016-12-18 19:51:29
【问题描述】:

我正在尝试从 cnn 文章中提取元标记

import httplib2
from bs4 import BeautifulSoup

http = httplib2.Http()
status, response = http.request(http://www.cnn.com/2016/08/09/health/chagas-sleeping-sickness-leishmaniasis-drug/index.html)
soup = BeautifulSoup(response)
print(soup.select('body > div.pg-right-rail-tall.pg-wrapper.pg__background__image > article > meta'))

我正在尝试将其缩小到仅此输出

<meta content="health" itemprop="articleSection"><meta content="2016-08-09T12:10:24Z" itemprop="dateCreated"><meta content="2016-08-09T12:10:24Z" itemprop="datePublished"><meta content="2016-08-09T12:10:24Z" itemprop="dateModified"><meta content="http://www.cnn.com/2016/08/09/health/chagas-sleeping-sickness-leishmaniasis-drug/index.html" itemprop="url"><meta content="Meera Senthilingam, for CNN" itemprop="author"><meta content="Could one discovery take on three deadly parasites?  - CNN.com" itemprop="headline"><meta content="Three seemingly different diseases infect 20 million people each year: Chagas disease, leishmaniasis and African sleeping sickness. But one drug could be developed to fight all three." itemprop="description"><meta content="sleeping sickness, disease, drug, drug development, chagas disease, leishmaniasis, Novartis, health, Could one discovery take on three deadly parasites?  - CNN.com" itemprop="keywords"><meta content="http://i2.cdn.turner.com/cnnnext/dam/assets/150812101743-chagas-bug-large-tease.jpg" itemprop="image"><meta content="http://i2.cdn.turner.com/cnnnext/dam/assets/150812101743-chagas-bug-large-tease.jpg" itemprop="thumbnailUrl"><meta content="Could one discovery take on three deadly parasites? " itemprop="alternativeHeadline">

但由于某种原因,BeautifulSoup.select() 方法返回的 html 大约是我想要的 100 倍。我非常感谢有关如何解决此问题的任何建议。

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    问题在于解析器/html,lxmlhtml5lib 为您提供您想要的。

    soup = BeautifulSoup(response,"lxml")
    

    或者:

     soup = BeautifulSoup(response,"html5lib")
    

    如果你没有安装lxmlhtml5lib,你可以使用pip安装html5liblxml 根据您的操作系统,它涉及的更多一点,因为它有一些依赖项,但绝对值得安装。

    您还可以简化您的选择:

    soup.select('div.pg-right-rail-tall.pg-wrapper.pg__background__image meta')
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-01-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-06-11
      相关资源
      最近更新 更多