使用唯一 url (python) 抓取网站答案

【问题标题】：Scrape web sites with unique url (python)使用唯一 url (python) 抓取网站
【发布时间】：2020-03-31 01:33:05
【问题描述】：

我目前正在做一个网页抓取项目，但我对网站的 url 有困难，因为当我浏览页面时它并没有改变。

网站：https://www.centris.ca/fr/triplex~a-vendre~montreal-mercier-hochelaga-maisonneuve?uc=1&view=Thumbnail

我的目标是把两个页面中的所有建筑物都刮掉。

我可以抓取数据的唯一方法是使用检查工具并复制所有广告的包装。

这是我的代码：

from bs4 import BeautifulSoup
import requests
import csv
import string
import glob

#Grab the soup (content)
source = requests.get("https://www.centris.ca/fr/triplex~a-vendre~montreal-mercier-hochelaga-maisonneuve?uc=1&view=Thumbnail")

soup = BeautifulSoup(source.content, 'html.parser')

    #Loop through all the ads on the page
    for ad in soup.find_all('div', {"data-id":"templateThumbnailItem"}):
        if (soup.find('div', {"class":"price"})):

            #Get the address
            address = ad.find('span', {"class":"address"})
            address = address.findChild().text
            address = address.strip()


            #Get the district
            district = ad.find('span', {"class":"address"})
            district = district.findChildren()[1].text
            district = district.strip()


            #Get the type
            typeBuilding = ad.find('span', {"class":"category"}).text
            typeBuilding = typeBuilding.strip()
            typeBuilding = typeBuilding[0:7].strip()


            #Get the Price
            price = ad.find('span', {"itemprop":"price"}).text
            price = price.replace('$','')
            price = price.replace(u'\xa0','')
            price = int(str(price))

            cnt = cnt + 1


            print(f'Adresse: {address}, Quartier: {district}, Type: {typeBuilding}, Prix: {price}$')

感谢您的帮助！

【问题讨论】：

我不确定我是否理解，你的问题是什么？
我想知道如何抓取网站上的两个页面。难点是当我点击第二页时url没有改变。

标签： python web-scraping beautifulsoup

【解决方案1】：

import requests
from bs4 import BeautifulSoup
import csv


def main(url):
    with requests.Session() as req:
        r = req.get(
            "https://www.centris.ca/fr/triplex~a-vendre~montreal-mercier-hochelaga-maisonneuve?uc=1&view=Thumbnail")
        with open("data.csv", 'w', newline="", encoding="UTF-8") as f:
            writer = csv.writer(f)
            writer.writerow(["Address", "Quartier", "Type", "Price"])
            for num in range(0, 40, 20):
                data = {'startPosition': num}
                r = req.post(url, json=data).json()
                html = r["d"]["Result"]["html"]
                soup = BeautifulSoup(html, 'html.parser')
                prices = [format(int(price.get("content")), ',d') for price in soup.findAll(
                    "span", itemprop="price")]
                block = soup.findAll("div", class_="location-container")
                ty = [ty.div.get_text(strip=True) for ty in block]
                add = [add.select_one(
                    "span.address div").text for add in block]
                quartier = [quar.select_one(
                    "span.address div:nth-child(2)").text for quar in block]
                final = zip(add, quartier, ty, prices)
                writer.writerows(final)


main("https://www.centris.ca/Mvc/Property/GetInscriptions")

输出：View Online

【讨论】：

谢谢！非常感谢！