获取访问页面时加载的 xhr 文档答案

【问题标题】：Get an xhr document that loads when you visit a page获取访问页面时加载的 xhr 文档
【发布时间】：2020-11-05 09:26:03
【问题描述】：

我试图获取我们可以在the following site 或其他人的照片下方看到的元素，等效：

但我无法从源代码中获取它。它应该使用 javascript 脚本动态下载。事实上它似乎在一个 xhr 文档中：

那么如何获取访问页面时下载的xhr文档呢？

我试过了：

url = "https://www.nosetime.com/xiangshui/350870-oulong-atelier-cologne-oolang-infini.html"

r = requests.post(url, headers=headers)
data = r.json()

print(data)

Pero me develve：

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-8-e72156ddb336> in <module>()
      2 
      3 r = requests.post(url, headers=headers)
----> 4 data = r.json()
      5 
      6 print(data)

3 frames
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
    355             obj, end = self.scan_once(s, idx)
    356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
    358         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

【问题讨论】：

标签： python-3.x ajax web-scraping xmlhttprequest

【解决方案1】：

只需添加正确的标题，您就有了数据。

import requests


headers = {
    "referer": "https://www.nosetime.com/xiangshui/350870-oulong-atelier-cologne-oolang-infini.html",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
}
response = requests.get("https://www.nosetime.com/app/item.php?id=350870", headers=headers).json()

print(response["id"], response["isscore"], response["brandid"])

由于某种原因，我无法将整个 JSON 输出粘贴为 SO 认为这是垃圾邮件... o.O.无论如何，这应该会给你JSON 响应。

打印出来：

350870 8.6 10091761

编辑：

如果您有更多产品，您可以简单地查看产品 URL 并从 JSON 中提取您需要的内容。例如，

import requests

product_urls = [
    "https://www.nosetime.com/xiangshui/947895-oulong-xuecheng-atelier-cologne-orange.html",
    "https://www.nosetime.com/xiangshui/705357-pomelo-paradis.html",
    "https://www.nosetime.com/xiangshui/592260-cl-mentine-california.html",
    "https://www.nosetime.com/xiangshui/612353-oulong-atelier-cologne-trefle.html",
    "https://www.nosetime.com/xiangshui/911317-oulong-nimingmeigui-atelier-cologne.html",
]


for product_url in product_urls:
    headers = {
        "referer": product_url,
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
    }
    product_id = product_url.split("/")[-1].split("-")[0]
    response = requests.get(
        f"https://www.nosetime.com/app/item.php?id={product_id}",
        headers=headers,
    ).json()
    print(f"Product name: {response['enname']} | Rating: {response['isscore']}")

输出：

Product name: Atelier Cologne Orange Sanguine, 2010 | Rating: 8.9
Product name: Atelier Cologne Pomelo Paradis, 2015 | Rating: 8.8
Product name: Atelier Cologne Clémentine California, 2016 | Rating: 8.6
Product name: Atelier Cologne Trefle Pur, 2010 | Rating: 8.6
Product name: Atelier Cologne Rose Anonyme, 2012 | Rating: 7.7

【讨论】：

等等，什么？太疯狂了。谢谢。可能是因为它是中文的，它认为它是病毒^^
哈哈。我猜就是这样！ :D
根据我提供的网址，您知道如何获取您正在使用的网址吗？
不确定您的确切意思，但如果所有产品的产品 URL 看起来都相同，只需用 / 拆分，然后 - 获取 id 号并将其放在这里 https://www.nosetime.com/app/item.php?id= .不要忘记将 URL 作为引用者也删除。
或者分享几个产品网址，我会更新答案。