POST 请求总是返回“不允许的键字符”答案

【问题标题】：POST request always returns "Disallowed Key Characters"POST 请求总是返回“不允许的键字符”
【发布时间】：2016-12-28 16:25:39
【问题描述】：

我想从表格中检索大气颗粒物值（遗憾的是，该网站不是英文的，所以请随时询问所有内容）：我使用 BeautifulSoup 和使用 requests 发送的 GET 请求的组合失败了，因为 table 充满了 Bootstrap 动态，并且像 BeautifulSoup 这样的解析器找不到仍然必须插入的值。

使用 Firebug，我检查了页面的每个角度，我发现通过选择表格的不同日期，会发送一个 POST 请求（您可以在Referer 中看到该站点是http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/，表在哪里）：

POST /temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini HTTP/1.1
Host: www.arpat.toscana.it    
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: */*    
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Referer: http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/26-12-2016
Content-Length: 114
Cookie: [...]
DNT: 1
Connection: keep-alive

使用以下参数：

v_data_osservazione=26-12-2016&v_tipo_bollettino=regionale&v_zona=&csrf_test_name=b88d2517c59809a529
b6f8141256e6ca

答案中的数据为 JSON 格式。

所以我开始制作我的个人 POST 请求，以便直接获取填充表格的 JSON 数据。

在参数中，除了日期之外，还需要一个csrf_test_name：在这里我发现这个站点受到CSRF vulnerability 的保护；为了在参数中执行正确的查询，我需要一个 CSRF 令牌：这就是为什么我对站点执行 GET 请求（请参阅 URL 的 POST 请求中的Referer）并从 cookie 中获取 CSRF 令牌，如下所示：

r = get(url)
csrf_token = r.cookies["csrf_cookie_name"]

一天结束时，准备好我的 CSRF 令牌和 POST 请求，我发送它...状态码为 200，我总是收到Disallowed Key Characters.！

寻找这个错误时，我总是看到关于 CodeIgniter 的帖子，这（我认为）不是我需要的：我尝试了标题和参数的每种组合，但没有任何改变。在放弃BeautifulSoup 和requests 开始学习Selenium 之前，我想弄清楚问题是什么：Selenium 级别太高，像BeautifulSoup 和requests 这样的低级别库让我学到了很多有用的东西，所以我更愿意继续学习这两个。

代码如下：

from requests import get, post
from bs4 import BeautifulSoup
import datetime
import json

url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/" # + %d-%m-%Y
yesterday = datetime.date.today() - datetime.timedelta(1)
date_object = datetime.datetime.strptime(str(yesterday), '%Y-%m-%d')
yesterday_string = str(date_object.strftime('%d-%m-%Y'))

full_url = url + yesterday_string
print("REFERER " + full_url)

r = get(url)
csrf_token = r.cookies["csrf_cookie_name"]
print(csrf_token)

# preparing headers for POST request
headers = {
    "Host": "www.arpat.toscana.it",
    "Accept" : "*/*",
    "Accept-Language" : "en-US,en;q=0.5",
    "Accept-Encoding" : "gzip, deflate",
    "Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
    "X-Requested-With" : "XMLHttpRequest", # XHR
    "Referer" : full_url,
    "DNT" : "1", 
    "Connection" : "keep-alive"
}

# preparing POST parameters (to be inserted in request's body)
payload_string = "v_data_osservazione="+yesterday_string+"&v_tipo_bollettino=regionale&v_zona=&csrf_test_name="+csrf_token
print(payload_string)

# data -- (optional) Dictionary, bytes, or file-like object to send in the body of the Request.

# json -- (optional) json data to send in the body of the Request.
req = post("http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini",
    headers = headers, json = payload_string
)

print("URL " + req.url)

print("RESPONSE:")
print('\t'+str(req.status_code))
print("\tContent-Encoding: " + req.headers["Content-Encoding"])
print("\tContent-type: " + req.headers["Content-type"])
print("\tContent-Length: " + req.headers["Content-Length"])
print('\t'+req.text)

【问题讨论】：

如果您想继续学习和使用必要的请求来满足您的需求，但在 cookie、引用和标头管理方面有帮助，我建议您查看scrapy
我非常想知道我的代码出了什么问题，但scrapy 可能值得一试。我现在要把pip 工作，谢谢你的建议。
您可以使用httpbin.org 发送POST 并将其接收到的所有数据发回 - 然后您可以将其与浏览器发送到服务器的数据进行比较。它有助于发现请求中的差异。

标签： python beautifulsoup httprequest python-requests

【解决方案1】：

此代码适用于我：

我使用 request.Session()，它会保留所有 cookie
我使用data= 而不是json=
最后我不需要所有的注释元素
比较浏览器请求和代码请求我使用Charles web 调试代理应用程序

代码：

import requests
import datetime

#proxies = {
#    'http': 'http://localhost:8888',
#    'https': 'http://localhost:8888',
#}

s = requests.Session()
#s.proxies = proxies # for test only

date = datetime.datetime.today() - datetime.timedelta(days=1)
date = date.strftime('%d-%m-%Y')

# --- main page ---

url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/"

print("REFERER:", url+date)

r = s.get(url)

# --- data ---

csrf_token = s.cookies["csrf_cookie_name"]

#headers = {
    #'User-Agent': 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0',
    #"Host": "www.arpat.toscana.it",
    #"Accept" : "*/*",
    #"Accept-Language" : "en-US,en;q=0.5",
    #"Accept-Encoding" : "gzip, deflate",
    #"Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
    #"X-Requested-With" : "XMLHttpRequest", # XHR
    #"Referer" : url,
    #"DNT" : "1", 
    #"Connection" : "keep-alive"
#}

payload = {
    'csrf_test_name': csrf_token,   
    'v_data_osservazione': date,
    'v_tipo_bollettino': 'regionale',
    'v_zona': None,
}

url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini"
r = s.post(url, data=payload) #, headers=headers)

print('Status:', r.status_code)
print(r.json())

代理：

【讨论】：

确实有效，谢谢！ Session 和 Charles 分开，似乎我以错误的方式将参数传递给 POST 请求。
日期有问题 - 还没有28-12-2016 的信息。
你说得对，我意识到数据还有待加载后我删除了评论，我删除了评论：现在是凌晨 2 点，我很累 xD 不太礼貌，因为你读了它并再次回答我，再次感谢您