【问题标题】:Normalizing HTML content when requesting URL with Requests使用 Requests 请求 URL 时规范化 HTML 内容
【发布时间】:2017-08-18 14:57:49
【问题描述】:

我在 Python 3.6 中使用 Requests 来获取 HTML 内容,代码如下:

import requests
url = 'https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)

但是,输出有很多“\n”字符的奇怪内容:

 b'<!DOCTYPE html>\n<!--[if (gt IE 9)|!(IE)]> <!--> <html lang="en" class="no-js section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" itemid="https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html" itemtype="http://schema.org/NewsArticle"  itemscope xmlns:og="http://opengraphprotocol.org/schema/"> <!--<![endif]-->\n<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->\n<!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> 
 <![endif]-->\n<!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]
 -->\n<head>\n ..."

如何修复它以获得完整的标准 HTML 输出?

【问题讨论】:

    标签: python html python-3.x python-requests


    【解决方案1】:

    使用response.text 而不是response.content - 如下面引用的请求文档中所述,这将使用 HTTP 响应提供的编码信息将响应内容解码为 Unicode 字符串:

    content

    响应的内容,以字节为单位。

    text

    响应的内容,以 unicode 表示。

    如果 Response.encoding 为 None,将使用 chardet 猜测编码。

    响应内容的编码仅根据 HTTP 标头确定,完全遵循 RFC 2616。如果您可以利用非 HTTP 知识更好地猜测编码,则应在访问此属性之前适当设置 r.encoding。

    例子:

    import requests
    url = 'https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    response = requests.get(url, headers=headers)
    print(response.text)
    

    输出:

    <!DOCTYPE html>
    <!--[if (gt IE 9)|!(IE)]> <!--> <html lang="en" class="no-js section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" itemid="https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html" itemtype="http://schema.org/NewsArticle"  itemscope xmlns:og="http://opengraphprotocol.org/schema/"> <!--<![endif]-->
    <!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
    <!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
    <!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
    <head>
        <title>Trump Offers No Apology for Claim on British Spying - The New York Times</title>
          <!-- etc ... -->
    </body>
    </html>
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-05-22
      • 1970-01-01
      • 2020-08-03
      • 1970-01-01
      • 2020-10-11
      • 1970-01-01
      • 2019-07-28
      相关资源
      最近更新 更多