【问题标题】:Scrape web sites with unique url (python)使用唯一 url (python) 抓取网站
【发布时间】:2020-03-31 01:33:05
【问题描述】:

我目前正在做一个网页抓取项目,但我对网站的 url 有困难,因为当我浏览页面时它并没有改变。

网站:https://www.centris.ca/fr/triplex~a-vendre~montreal-mercier-hochelaga-maisonneuve?uc=1&view=Thumbnail

我的目标是把两个页面中的所有建筑物都刮掉。

我可以抓取数据的唯一方法是使用检查工具并复制所有广告的包装。

这是我的代码:

from bs4 import BeautifulSoup
import requests
import csv
import string
import glob

#Grab the soup (content)
source = requests.get("https://www.centris.ca/fr/triplex~a-vendre~montreal-mercier-hochelaga-maisonneuve?uc=1&view=Thumbnail")

soup = BeautifulSoup(source.content, 'html.parser')

    #Loop through all the ads on the page
    for ad in soup.find_all('div', {"data-id":"templateThumbnailItem"}):
        if (soup.find('div', {"class":"price"})):

            #Get the address
            address = ad.find('span', {"class":"address"})
            address = address.findChild().text
            address = address.strip()


            #Get the district
            district = ad.find('span', {"class":"address"})
            district = district.findChildren()[1].text
            district = district.strip()


            #Get the type
            typeBuilding = ad.find('span', {"class":"category"}).text
            typeBuilding = typeBuilding.strip()
            typeBuilding = typeBuilding[0:7].strip()


            #Get the Price
            price = ad.find('span', {"itemprop":"price"}).text
            price = price.replace('$','')
            price = price.replace(u'\xa0','')
            price = int(str(price))

            cnt = cnt + 1


            print(f'Adresse: {address}, Quartier: {district}, Type: {typeBuilding}, Prix: {price}$')

感谢您的帮助!

【问题讨论】:

  • 我不确定我是否理解,你的问题是什么?
  • 我想知道如何抓取网站上的两个页面。难点是当我点击第二页时url没有改变。

标签: python web-scraping beautifulsoup


【解决方案1】:
import requests
from bs4 import BeautifulSoup
import csv


def main(url):
    with requests.Session() as req:
        r = req.get(
            "https://www.centris.ca/fr/triplex~a-vendre~montreal-mercier-hochelaga-maisonneuve?uc=1&view=Thumbnail")
        with open("data.csv", 'w', newline="", encoding="UTF-8") as f:
            writer = csv.writer(f)
            writer.writerow(["Address", "Quartier", "Type", "Price"])
            for num in range(0, 40, 20):
                data = {'startPosition': num}
                r = req.post(url, json=data).json()
                html = r["d"]["Result"]["html"]
                soup = BeautifulSoup(html, 'html.parser')
                prices = [format(int(price.get("content")), ',d') for price in soup.findAll(
                    "span", itemprop="price")]
                block = soup.findAll("div", class_="location-container")
                ty = [ty.div.get_text(strip=True) for ty in block]
                add = [add.select_one(
                    "span.address div").text for add in block]
                quartier = [quar.select_one(
                    "span.address div:nth-child(2)").text for quar in block]
                final = zip(add, quartier, ty, prices)
                writer.writerows(final)


main("https://www.centris.ca/Mvc/Property/GetInscriptions")

输出:View Online

【讨论】:

  • 谢谢!非常感谢!
猜你喜欢
  • 1970-01-01
  • 2020-09-28
  • 1970-01-01
  • 2016-05-27
  • 1970-01-01
  • 2017-01-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多