爬取新闻网站并获取新闻内容答案

【问题标题】：Crawl a news website and getting the news content爬取新闻网站并获取新闻内容
【发布时间】：2016-10-10 15:40:45
【问题描述】：

我正在尝试从新闻网站下载文本。 HTML 是：

<div class="pane-content">
<div class="field field-type-text field-field-noticia-bajada">
<div class="field-items">
        <div class="field-item odd">
                 <p>"My Text" target="_blank">www.injuv.cl</a></strong></p>         </div>

输出应该是：我的文本我正在使用以下 python 代码：

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = "My URL"
parsed_html = BeautifulSoup(html)
p = parsed_html.find("div", attrs={'class':'pane-content'})
print(p)

但代码的输出是：“无”。你知道我的代码有什么问题吗？

【问题讨论】：

即使您解析的是 HTML 而不是 URL，HTML 也是无效的。你无法用 BeautifulSoup 解析它。
@tobltobs BeautifulSoup 尝试修复损坏的 HTML；它可以很好地解析该 HTML。

标签： python beautifulsoup html-parser

【解决方案1】：

问题是你不是在解析 HTML，你是在解析 URL 字符串：

html = "My URL"
parsed_html = BeautifulSoup(html)

相反，您需要先获取/检索/下载源代码，例如在 Python 2 中：

from urllib2 import urlopen

html = urlopen("My URL")
parsed_html = BeautifulSoup(html)

在 Python 3 中，它将是：

from urllib.request import urlopen

html = urlopen("My URL")
parsed_html = BeautifulSoup(html)

或者，您可以使用第三方“人类”风格的requests library：

import requests

html = requests.get("My URL").content
parsed_html = BeautifulSoup(html)

还请注意，您根本不应该使用BeautifulSoup 版本 3 - 它不再维护。替换：

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

只需：

from bs4 import BeautifulSoup

【讨论】：

【解决方案2】：

BeautifulSoup 接受 HTML 字符串。您需要使用 URL 从页面中检索 HTML。

查看urllib 以发出 HTTP 请求。（或requests 以获得更简单的方法。）检索 HTML 并将 that 传递给 BeautifulSoup，如下所示：

import urllib
from bs4 import BeautifulSoup

# Get the HTML
conn = urllib.urlopen("http://www.example.com")
html = conn.read()

# Give BeautifulSoup the HTML:
soup = BeautifulSoup(html)

从这里开始，按照您之前的尝试进行解析。

p = soup.find("div", attrs={'class':'pane-content'})
print(p)

【讨论】：