【问题标题】:Having trouble maintaining order of Session headers when making a request发出请求时无法维护 Session 标头的顺序
【发布时间】:2020-01-17 22:19:30
【问题描述】:

一位论坛用户建议我,为了避免被发现,我需要维护 与我的浏览器相同的标题顺序。 我在这里查看了建议:

Python HTTP request with controlled ordering of HTTP headers

但是,尽管尝试了这些建议,但顺序正在发生变化。 我不知道我做错了什么(请注意 cookie 最终会出现):

import requests
import webbrowser
from bs4 import BeautifulSoup
import re
from collections import OrderedDict


BASE_URL = 'https://www.bloomberg.com/'
HEADERS = OrderedDict({'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
           'Accept-Encoding': 'gzip, deflate, br',
           'Cache-Control': 'max-age=0',
           'Connection': 'keep-alive',
            'Cookie': '',
            'Host': 'www.bloomberg.com',
            'Origin' : 'https://www.bloomberg.com',
            'Referer': 'https://www.bloomberg.com/',
             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:64.0) Gecko/20100101 Firefox/64.0',

           })




def latest_news():
        session = requests.Session()
        session.headers = HEADERS
##        session.headers['User-Agent'] = HEADERS['User-Agent']
##        session.headers['Referer'] = HEADERS['Referer']
##        #session.headers['Origin'] = HEADERS['Origin']
##        session.headers['Host'] =   HEADERS['Host'] 

        page = session.get(BASE_URL, allow_redirects = True)
        print(page.url)
        print(page.request.headers)
        print(page.history)
        page.raise_for_status()
        soup = BeautifulSoup(page.content, 'html.parser')
        print(soup)

if __name__ == "__main__":

    latest_news()

输出:

 https://www.bloomberg.com/tosv2.html?vid=&uuid=e5737f50-3975-11ea-b7bd-97b9265w12w5&url=Lw==


#Request Headers      

{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
     'Accept-Encoding': 'gzip, deflate, br', 
    'Cache-Control': 'max-age=0',
     'Connection': 'keep-alive', 
    'Host': 'www.bloomberg.com', 
    'Origin': 'https://www.bloomberg.com', 
    'Referer': 'https://www.bloomberg.com/',
     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:64.0)
     Gecko/20100101 Firefox/64.0', 
    'Cookie': '_pxhd=4c7cs06d7c42as40601e7d338a1084ca96e4ee91dfa42bd2368e86fec4e66bcd1:e573a66d0-397x5-11ea-b7bd-97b9265412f5'}


[<Response [307]>]

<h1 class="logo">Bloomberg</h1>

【问题讨论】:

    标签: python web-scraping python-requests


    【解决方案1】:

    这是我写的一般性答案,因为我遇到了类似的问题,您的问题可能是网络服务器要求您将这些 cookie 添加到您的进一步请求中。您已将 cookie 设置为 '',因此它们将被丢弃,并且您的新 cookie 会根据服务器请求附加到标头的末尾。
    如果我们只使用 get() 会怎样:

    import requests
    import logging
    import http.client as http_client
    http_client.HTTPConnection.debuglevel = 1
    
    #init logging
    logging.basicConfig()
    logging.getLogger().setLevel(logging.DEBUG)
    requests_log = logging.getLogger("requests.packages.urllib3")
    requests_log.setLevel(logging.DEBUG)
    requests_log.propagate = True
    
    requests.get("http://google.com", allow_redirects=False)
    

    在这里,我启用了日志记录,因此您可以看到正在发出的请求(日志代码不会在以后的示例中显示)。这会产生输出:

    DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): google.com:80
    send: b'GET / HTTP/1.1\r\nHost: google.com\r\nUser-Agent: python-requests/2.21.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
    ...
    

    如您所见,即使我们没有告诉它,请求也会启动一些标头。现在,如果我们以某种我们想要的格式将一些标头传递给它会发生什么?

    import requests
    headers = { 
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-US,en;q=0.9",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0"
    }
    requests.get("http://google.com", headers=headers, allow_redirects=False)
    

    在这里,我们希望“用户代理”出现在我们的请求结束时,但输出显示其他情况:

     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): google.com:80
    send: b'GET / HTTP/1.1\r\nHost: google.com\r\nuser-agent: Mozilla/5.0\r\naccept-encoding: gzip, deflate, br\r\naccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\nConnection: keep-alive\r\naccept-language: en-US,en;q=0.9\r\nupgrade-insecure-requests: 1\r\n\r\n'
    ...
    

    “用户代理”出现在中间!是什么赋予了?让我们看一下库中的一些源代码。


    def __init__(self):
    
            #: A case-insensitive dictionary of headers to be sent on each
            #: :class:`Request <Request>` sent from this
            #: :class:`Session <Session>`.
            self.headers = default_headers()
            ...
    

    当我们启动Session 时,它所做的第一件事就是为其分配默认标头,并且用户“间接”(通过函数)提供的任何其他标头都将附加到默认标头。
    这是一个问题,因为当您附加两个字典(甚至是 OrderedDicts)时,结果会保留原始字典的顺序。我们可以在上面的示例中看到这一点,其中“user-agent”属性保留了它在字典中的第二个位置
    如果您有兴趣,这是附加过程的代码:

    def merge_setting(request_setting, session_setting, dict_class=OrderedDict):
        """Determines appropriate setting for a given request, taking into account
        the explicit setting on that request, and the setting in the session. If a
        setting is a dictionary, they will be merged together using `dict_class`
        """
    
        if session_setting is None:
            return request_setting
    
        if request_setting is None:
            return session_setting
    
        # Bypass if not a dictionary (e.g. verify)
        if not (
                isinstance(session_setting, Mapping) and
                isinstance(request_setting, Mapping)
        ):
            return request_setting
    
        merged_setting = dict_class(to_key_val_list(session_setting))
        merged_setting.update(to_key_val_list(request_setting))
    
        # Remove keys that are set to None. Extract keys first to avoid altering
        # the dictionary during iteration.
        none_keys = [k for (k, v) in merged_setting.items() if v is None]
        for key in none_keys:
            del merged_setting[key]
    
        return merged_setting
    


    那么修复是什么?

    您必须完全覆盖默认标题。我能想到的方法是使用Session,然后直接替换标题字典:

    session = requests.Session()
    headers = { 
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-US,en;q=0.9",
    "cookie": "Cookie: Something",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0"
    }
    # session.cookies.add_cookie_header(session)
    session.headers = headers
    a = session.get("https://google.com/", allow_redirects=False)
    

    产生所需的输出,不需要任何OrderedDict

    DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): google.com:443
    send: b'GET / HTTP/1.1\r\nHost: google.com\r\naccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\naccept-encoding: gzip, deflate, br\r\naccept-language: en-US,en;q=0.9\r\ncookie: Cookie: Something\r\nupgrade-insecure-requests: 1\r\nuser-agent: Mozilla/5.0\r\n\r\n'
    ...
    

    上面的例子证明一切都保持在它应该在的地方,即使你检查response.request.headers一切都应该井井有条(至少对我来说是这样)
    P.S:我没有费心去检查使用 OrderedDict 是否会有所不同,但如果您仍然有任何问题,请尝试使用一个。

    【讨论】:

    • 感谢您的回答。我将有机会在几天内彻底完成它,并会回复您。
    • 好的,我认为将 cookie 设置为空确实是罪魁祸首。非常感谢您的回答。我学到了一些东西(比如使用请求进行调试)。
    猜你喜欢
    • 1970-01-01
    • 2021-12-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-08-28
    • 2019-05-31
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多