【问题标题】:pycurl and curl behaving differently when requesting same resource; curl correctly gives a JSON object, PycURL a HTML objectpycurl 和 curl 在请求相同资源时表现不同; curl 正确地给出一个 JSON 对象,PycURL 一个 HTML 对象
【发布时间】:2021-10-03 16:44:20
【问题描述】:

ipinfo.io 提供与 IP 地址对应的网站/服务器的信息,可以通过在他们的website 中输入它,或者通过 curl 命令行实用程序向他们发送请求,例如:

$ curl  https://ipinfo.io/172.217.169.6

以 JSON 格式输出:

{
  "ip": "172.217.169.68",
  "hostname": "lhr48s09-in-f4.1e100.net",
  "city": "London",
  "region": "England",
  "country": "GB",
  "loc": "51.5085,-0.1257",
  "org": "AS15169 Google LLC",
  "postal": "EC1A",
  "timezone": "Europe/London",
  "readme": "https://ipinfo.io/missingauth"
}

我最终要做的是在 Python 中执行此操作并将此结果存储为 JSON 对象。我相信下面的代码,使用pycURL 应该会产生相同的输出:

import pycurl
from io import BytesIO

buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, "https://ipinfo.io/172.217.169.6")
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close

body = buffer.getvalue()
print(body.decode('iso-8859-1'))

即,将相同的 JSON 字符串写入缓冲区。

但是,它会打印大量 HTML 输出,即我怀疑来自实际页面 pycURL 的 HTML 正在请求来自而不是 JSON 数据的数据。例如:

<!DOCTYPE html>
<html>
<head>
    <title>
    172.217.169.6 IP Address Details
 - IPinfo.io</title>
    <meta charset="utf-8">
    <meta name="apple-itunes-app" content="app-id=917634022">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no, user-scalable=no">
    <meta name="description" content="Full IP address details for 172.217.169.6 (AS15169 Google LLC) including geolocation and map, hostname, and API details.">

    <link rel="manifest" href="/static/manifest.json">
    <link rel="icon" sizes="48x48" href="/static/deviceicons/android-icon-48x48.png">


...
    

</html>

基本上,如何让 pycURL 也接收此 JSON 数据?



我尝试比较两者的详细输出,但我无法弄清楚为什么它们的行为不同,只是内容类型字段不同; curl 的“application/json”和 pycURL 的“text/html”,它解释了不同的输出。冒着使这篇文章冗长的风险,我还在下面提供了它们:

curl(命令行)详细输出:

$ curl -v https://ipinfo.io/172.217.169.6
*   Trying 34.117.59.81:443...
* TCP_NODELAY set
* Connected to ipinfo.io (34.117.59.81) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=ipinfo.io
*  start date: Jul 10 20:18:59 2021 GMT
*  expire date: Oct  8 21:18:59 2021 GMT
*  subjectAltName: host "ipinfo.io" matched cert's "ipinfo.io"
*  issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1D4
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55a887a40e10)
> GET /172.217.169.6 HTTP/2
> Host: ipinfo.io
> user-agent: curl/7.68.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200 
< access-control-allow-origin: *
< x-frame-options: DENY
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< referrer-policy: strict-origin-when-cross-origin
< content-type: application/json; charset=utf-8
< content-length: 286
< date: Tue, 27 Jul 2021 21:03:50 GMT
< x-envoy-upstream-service-time: 1
< via: 1.1 google
< alt-svc: clear
< 
{
  "ip": "172.217.169.6",
  "hostname": "lhr25s26-in-f6.1e100.net",
  "city": "London",
  "region": "England",
  "country": "GB",
  "loc": "51.5085,-0.1257",
  "org": "AS15169 Google LLC",
  "postal": "EC1A",
  "timezone": "Europe/London",
  "readme": "https://ipinfo.io/missingauth"
* Connection #0 to host ipinfo.io left intact
}

pycURL 详细输出:

$ python3 ip_helper.py
*   Trying 34.117.59.81:443...
* TCP_NODELAY set
* Connected to ipinfo.io (34.117.59.81) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=ipinfo.io
*  start date: Jul 10 20:18:59 2021 GMT
*  expire date: Oct  8 21:18:59 2021 GMT
*  subjectAltName: host "ipinfo.io" matched cert's "ipinfo.io"
*  issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1D4
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x19d65c0)
> GET /172.217.169.6 HTTP/2
Host: ipinfo.io
user-agent: PycURL/7.43.0.6 libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
accept: */*

* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200 
< access-control-allow-origin: *
< x-frame-options: DENY
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< referrer-policy: strict-origin-when-cross-origin
< content-type: text/html; charset=utf-8
< content-length: 44645
< date: Tue, 27 Jul 2021 21:07:50 GMT
< x-envoy-upstream-service-time: 13
< via: 1.1 google
< alt-svc: clear
< 
* Connection #0 to host ipinfo.io left intact
<!DOCTYPE html>
<html>
<head>
    <title>
    172.217.169.6 IP Address Details
 - IPinfo.io</title>
    <meta charset="utf-8">
    <meta name="apple-itunes-app" content="app-id=917634022">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no, user-scalable=no">
    <meta name="description" content="
    
        Full IP address details for 172.217.169.6 (AS15169 Google LLC) including geolocation and map, hostname, and API details.
    
">

    <link rel="manifest" href="/static/manifest.json">
    <link rel="icon" sizes="48x48" href="/static/deviceicons/android-icon-48x48.png">


...

</html>

感谢您的宝贵时间

【问题讨论】:

标签: python html json curl pycurl


【解决方案1】:

来自docs

我们尝试自动检测何时有人想要调用我们的 API 而不是查看我们的网站,然后我们发送回适当的 JSON 响应而不是 HTML。我们基于已知流行编程语言、工具和框架的用户代理来执行此操作。但是,当 JSON 响应没有自动发生时,还有其他几种方法可以强制执行它。一种是在URL中添加/json,另一种是给application/json设置一个Accept header

所以看起来有三种不同的方法可以使用pycurl 获取 JSON。

  1. /json 附加到您的网址:
c.setopt(c.URL, "https://ipinfo.io/172.217.169.6/json")
  1. 将您的 Accept 标头设置为仅允许 JSON 响应:
c.setopt(c.HTTPHEADER, ["Accept: application/json"])
  1. 设置您的User-Agent 标头,让网站认为它正在与curl 交谈,而不是pycurl
c.setopt(c.HTTPHEADER, ["User-Agent: curl"])

【讨论】:

    最近更新 更多