美汤文章刮答案

【问题标题】：beautiful soup article scraping美汤文章刮
【发布时间】：2014-03-14 17:29:45
【问题描述】：

我正在尝试获取文章正文中的所有 p 标签。我想知道是否有人可以解释为什么我的代码错误以及如何改进它。下面是文章的网址和相关代码。感谢您提供的任何见解。

网址：http://www.france24.com/en/20140310-libya-seize-north-korea-crude-oil-tanker-rebels-port-rebels/

import urllib2
from bs4 import BeautifulSoup

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

soup = BeautifulSoup(urllib2.urlopen(url).read())

# retrieve all of the paragraph tags
body = soup.find("div", {'class':'bd'}).get_text()
for tag in body:
    p = soup.find_all('p')
    print str(p) + '\n' + '\n'

【问题讨论】：

标签： python-2.7 beautifulsoup

【解决方案1】：

问题是页面上有多个带有class="bd" 的div 标签。看起来您需要包含实际文章的文章 - 它位于 article 标签内：

import urllib2
from bs4 import BeautifulSoup

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

soup = BeautifulSoup(urllib2.urlopen(url))

# retrieve all of the paragraph tags
paragraphs = soup.find('article').find("div", {'class': 'bd'}).find_all('p')
for paragraph in paragraphs:
    print paragraph.text

打印：

Libyan government forces on Monday seized a North Korea-flagged tanker after...
...

希望对您有所帮助。

【讨论】：

+1：差点发布我的变体，但你先得到它。不过，在我的情况下，我必须将 encode("utf-8") 添加到我的 print 行。唯一的区别是我使用了requests 而不是urllib2。
@Nanashi 谢谢，通常我也更喜欢requests，但 OP 使用 urllib2 - 决定使代码接近 OP 提供的代码。
谢谢，效果很好！出于好奇，您为什么更喜欢请求？我是 python 新手，所以我想学习我能学到的一切。
大声笑我做出了改变。它也阻止了代码中断引号。