【问题标题】：Having trouble maintaining order of Session headers when making a request发出请求时无法维护 Session 标头的顺序
【发布时间】：2020-01-17 22:19:30
【问题描述】：

一位论坛用户建议我，为了避免被发现，我需要维护与我的浏览器相同的标题顺序。我在这里查看了建议：

Python HTTP request with controlled ordering of HTTP headers

但是，尽管尝试了这些建议，但顺序正在发生变化。我不知道我做错了什么（请注意 cookie 最终会出现）：

import requests
import webbrowser
from bs4 import BeautifulSoup
import re
from collections import OrderedDict


BASE_URL = 'https://www.bloomberg.com/'
HEADERS = OrderedDict({'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
           'Accept-Encoding': 'gzip, deflate, br',
           'Cache-Control': 'max-age=0',
           'Connection': 'keep-alive',
            'Cookie': '',
            'Host': 'www.bloomberg.com',
            'Origin' : 'https://www.bloomberg.com',
            'Referer': 'https://www.bloomberg.com/',
             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:64.0) Gecko/20100101 Firefox/64.0',

           })




def latest_news():
        session = requests.Session()
        session.headers = HEADERS
##        session.headers['User-Agent'] = HEADERS['User-Agent']
##        session.headers['Referer'] = HEADERS['Referer']
##        #session.headers['Origin'] = HEADERS['Origin']
##        session.headers['Host'] =   HEADERS['Host'] 

        page = session.get(BASE_URL, allow_redirects = True)
        print(page.url)
        print(page.request.headers)
        print(page.history)
        page.raise_for_status()
        soup = BeautifulSoup(page.content, 'html.parser')
        print(soup)

if __name__ == "__main__":

    latest_news()

输出：

 https://www.bloomberg.com/tosv2.html?vid=&uuid=e5737f50-3975-11ea-b7bd-97b9265w12w5&url=Lw==


#Request Headers      

{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
     'Accept-Encoding': 'gzip, deflate, br', 
    'Cache-Control': 'max-age=0',
     'Connection': 'keep-alive', 
    'Host': 'www.bloomberg.com', 
    'Origin': 'https://www.bloomberg.com', 
    'Referer': 'https://www.bloomberg.com/',
     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:64.0)
     Gecko/20100101 Firefox/64.0', 
    'Cookie': '_pxhd=4c7cs06d7c42as40601e7d338a1084ca96e4ee91dfa42bd2368e86fec4e66bcd1:e573a66d0-397x5-11ea-b7bd-97b9265412f5'}


[<Response [307]>]

<h1 class="logo">Bloomberg</h1>

【问题讨论】：

标签： python web-scraping python-requests

【解决方案1】：

这是我写的一般性答案，因为我遇到了类似的问题，您的问题可能是网络服务器要求您将这些 cookie 添加到您的进一步请求中。您已将 cookie 设置为 ''，因此它们将被丢弃，并且您的新 cookie 会根据服务器请求附加到标头的末尾。
如果我们只使用 get() 会怎样：

import requests
import logging
import http.client as http_client
http_client.HTTPConnection.debuglevel = 1

#init logging
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True

requests.get("http://google.com", allow_redirects=False)

在这里，我启用了日志记录，因此您可以看到正在发出的请求（日志代码不会在以后的示例中显示）。这会产生输出：

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): google.com:80
send: b'GET / HTTP/1.1\r\nHost: google.com\r\nUser-Agent: python-requests/2.21.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
...

如您所见，即使我们没有告诉它，请求也会启动一些标头。现在，如果我们以某种我们想要的格式将一些标头传递给它会发生什么？

import requests
headers = { 
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0"
}
requests.get("http://google.com", headers=headers, allow_redirects=False)

在这里，我们希望“用户代理”出现在我们的请求结束时，但输出显示其他情况：

 DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): google.com:80
send: b'GET / HTTP/1.1\r\nHost: google.com\r\nuser-agent: Mozilla/5.0\r\naccept-encoding: gzip, deflate, br\r\naccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\nConnection: keep-alive\r\naccept-language: en-US,en;q=0.9\r\nupgrade-insecure-requests: 1\r\n\r\n'
...

“用户代理”出现在中间！是什么赋予了？让我们看一下库中的一些源代码。

def __init__(self):

        #: A case-insensitive dictionary of headers to be sent on each
        #: :class:`Request <Request>` sent from this
        #: :class:`Session <Session>`.
        self.headers = default_headers()
        ...

当我们启动Session 时，它所做的第一件事就是为其分配默认标头，并且用户“间接”（通过函数）提供的任何其他标头都将附加到默认标头。
这是一个问题，因为当您附加两个字典（甚至是 OrderedDicts）时，结果会保留原始字典的顺序。我们可以在上面的示例中看到这一点，其中“user-agent”属性保留了它在字典中的第二个位置。
如果您有兴趣，这是附加过程的代码：

def merge_setting(request_setting, session_setting, dict_class=OrderedDict):
    """Determines appropriate setting for a given request, taking into account
    the explicit setting on that request, and the setting in the session. If a
    setting is a dictionary, they will be merged together using `dict_class`
    """

    if session_setting is None:
        return request_setting

    if request_setting is None:
        return session_setting

    # Bypass if not a dictionary (e.g. verify)
    if not (
            isinstance(session_setting, Mapping) and
            isinstance(request_setting, Mapping)
    ):
        return request_setting

    merged_setting = dict_class(to_key_val_list(session_setting))
    merged_setting.update(to_key_val_list(request_setting))

    # Remove keys that are set to None. Extract keys first to avoid altering
    # the dictionary during iteration.
    none_keys = [k for (k, v) in merged_setting.items() if v is None]
    for key in none_keys:
        del merged_setting[key]

    return merged_setting

那么修复是什么？

您必须完全覆盖默认标题。我能想到的方法是使用Session，然后直接替换标题字典：

session = requests.Session()
headers = { 
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"cookie": "Cookie: Something",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0"
}
# session.cookies.add_cookie_header(session)
session.headers = headers
a = session.get("https://google.com/", allow_redirects=False)

产生所需的输出，不需要任何OrderedDict

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): google.com:443
send: b'GET / HTTP/1.1\r\nHost: google.com\r\naccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\naccept-encoding: gzip, deflate, br\r\naccept-language: en-US,en;q=0.9\r\ncookie: Cookie: Something\r\nupgrade-insecure-requests: 1\r\nuser-agent: Mozilla/5.0\r\n\r\n'
...

上面的例子证明一切都保持在它应该在的地方，即使你检查response.request.headers一切都应该井井有条（至少对我来说是这样）
P.S：我没有费心去检查使用 OrderedDict 是否会有所不同，但如果您仍然有任何问题，请尝试使用一个。

【讨论】：

感谢您的回答。我将有机会在几天内彻底完成它，并会回复您。
好的，我认为将 cookie 设置为空确实是罪魁祸首。非常感谢您的回答。我学到了一些东西（比如使用请求进行调试）。