【问题标题】:I would like to exclude bold paragraphs from the website我想从网站中排除粗体段落
【发布时间】:2019-04-11 20:46:29
【问题描述】:

我使用以下代码来抓取网站:


import requests
from bs4 import BeautifulSoup
resp = requests.get('https://www.ecb.europa.eu/press/pressconf/2018/html/ecb.is180913.en.html')
soup = BeautifulSoup(resp.content, 'html5lib')
article = soup.find('article')
paragraphs = article.find_all('p')

输出如下:

[<p>Based on our regular economic and monetary analyses, we decided to keep the <strong>key ECB interest rates</strong> unchanged. .... to levels that are below, but close to, 2% over the medium term.</p>,
<p><strong>Has QE been used well by the various euro area countries?</strong></p>,
 <p>By and large, yes, it's been used well in the sense that the intended effects of the QE – mind, ... It reduced dispersion in growth rates everywhere. An employment situation which is by and large improving almost everywhere, some countries more than others. </p>,
 <p>If your question is meant to say; shouldn't governments have taken advantage of the situation of such low rates to decrease budget deficits, to restore? ... is a good situation for doing that.</p>,
 <p><strong>My second question is on reinvestment. ...Have you today explicitly asked the committees to come up with proposals on reinvestments?</strong></p>,
 <p>About inflation: I said inflation is going to hover around the present level for the rest of the year and then I gave numbers for next year and 2020. ...will reach our objective over the medium term. </p>,]

我想排除包含

的粗体段落
 <p><strong>

并且有超过 15 个单词。期望的输出应该是:

[<p>Based on our regular economic and monetary analyses, we decided to keep the <strong>key ECB interest rates</strong> unchanged. .... to levels that are below, but close to, 2% over the medium term.</p>,
 <p>By and large, yes, it's been used well in the sense that the intended effects of the QE – mind, ... It reduced dispersion in growth rates everywhere. An employment situation which is by and large improving almost everywhere, some countries more than others. </p>,
 <p>If your question is meant to say; shouldn't governments have taken advantage of the situation of such low rates to decrease budget deficits, to restore? ... is a good situation for doing that.</p>,
 <p>About inflation: I said inflation is going to hover around the present level for the rest of the year and then I gave numbers for next year and 2020. ...will reach our objective over the medium term. </p>,]

我尝试编写代码,但未能获得所需的输出。如果您能帮助我,我将不胜感激。

【问题讨论】:

标签: python web-scraping beautifulsoup


【解决方案1】:

试试extract()函数:

article = soup.find('article')
paragraphs = article.find_all('p')

article.strong.extract()
paragraphs_without_bold = article.find_all('p')

另见this

【讨论】:

  • 感谢您推荐的链接,@petezurich
【解决方案2】:

使用str() 将bs4 对象转换为字符串,如&lt;p&gt;&lt;strong&gt;......&lt;/strong&gt;&lt;/p&gt;

....
paragraphs = article.find_all('p')

for p in paragraphs:
    if '<p><strong>' not in str(p):
        print str(p)

【讨论】:

  • 你救了我,@ewwink。非常感谢您的宝贵时间。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2021-03-22
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-09-06
相关资源
最近更新 更多