使用 Python 请求抓取 ajax 网站答案

【问题标题】：Scraping an ajax website using Python requests使用 Python 请求抓取 ajax 网站
【发布时间】：2018-10-10 20:57:01
【问题描述】：

我正在尝试抓取 5 秒后加载内容的网页。我想使用 lib 请求。有什么东西可以让请求等待吗？

import requests
from bs4 import BeautifulSoup as soup
from time import sleep

link = 'https://www.off---white.com'
while True:
    try:
        r = requests.get(link, stream=False, timeout=8)
        break
    except:
        if r.status_code == 404:
            print("Client error")
            r.raise_for_status()
        sleep(1)


page = soup(r.text, "html.parser")

products = page.findAll('article', class_='product')
titles = page.findAll('span', class_='prod-title')[0].text.strip()
images= page.findAll('img', class_="js-scroll-gallery-snap-target")

for product in products:
    print(product)

【问题讨论】：

标签： python-3.x web-scraping python-requests

【解决方案1】：

我曾经回答过这样的问题，但提问者给出了更好的答案 cfscrape ， cfscrape 在这个网站上比 selenium 更好。顺便说一句，这个问题似乎已经结束了，我不知道为什么。

import cfscrape
import requests
from bs4 import BeautifulSoup as soup

url = "https://www.off---white.com"
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20180101 Firefox/47.0",
    "Referer" : url
}
session = requests.session()
scraper = cfscrape.create_scraper(sess=session)
link = 'https://www.off---white.com'
r = scraper.get(link, headers=headers)
page = soup(r.text, "html.parser")

2020 年 4 月 15 日更新

自从 off-white 更新了他的保护，cfscrape 现在不是一个好主意。请尝试使用硒。

对于这类问题，我无法给出一个永远有效的答案。他们不断更新他们的保护！

【讨论】：

【解决方案2】：

不，接收到的内容总是一样的，你必须自己预渲染才能获取网页的最终版本。

您必须使用无头浏览器来执行网页上的 javascript。

Prerender.IO 提供了你需要的东西，你可以去看看，设置很简单。

const prerender = require('prerender');
const server = prerender();
server.start();

【讨论】：