【问题标题】:Web-scraping in IBM Watson Studio Jupyter Notebook using BeautifulSoup not working在 IBM Watson Studio Jupyter Notebook 中使用 BeautifulSoup 进行 Web 抓取不起作用
【发布时间】:2021-02-26 06:24:55
【问题描述】:

我希望从此搜索结果页面抓取 IBM Watson Studio Jupyter Notebook 中的数据:

https://www.aspc.co.uk/search/?PrimaryPropertyType=Rent&SortBy=PublishedDesc&LastUpdated=AddedAnytime&SearchTerm=&PropertyType=Residential&PriceMin=&PriceMax=&Bathrooms=&OrMoreBathrooms=true&Bedrooms=&OrMoreBedrooms=true&HasCentralHeating=false&HasGarage=false&HasDoubleGarage=false&HasGarden=false&IsNewBuild=false&IsDevelopment=false&IsParkingAvailable=false&IsPartExchangeConsidered=false&PublicRooms=&OrMorePublicRooms=true&IsHmoLicense=false&IsAllowPets=false&IsAllowSmoking=false&IsFullyFurnished=false&IsPartFurnished=false&IsUnfurnished=false&ExcludeUnderOffer=false&IncludeClosedProperties=true&ClosedDatesSearch=14&MapSearchType=EDITED&ResultView=LIST&ResultMode=NONE&AreaZoom=13&AreaCenter[lat]=57.14955426557916&AreaCenter[lng]=-2.0927401123046785&EditedZoom=13&EditedCenter[lat]=57.14955426557916&EditedCenter[lng]=-2.0927401123046785

我已经尝试过 BeautifulSoup 并尝试过 Selenium(完全公开:我是初学者)对多种代码变体。我已经在 Stack Overflow、Medium 文章等上解决了几十个问题,但我无法理解我做错了什么。

我最近在做的是:

from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

properties_containers = html_soup.find_all('div', class_ = 'information-card property-card  col  ')
print(type(properties_containers))
print(len(properties_containers))

这将返回 0。

<class 'bs4.element.ResultSet'>
0

有人可以指导我正确的方向吗?我做错了什么/错过了什么?

【问题讨论】:

    标签: html web-scraping beautifulsoup jupyter-notebook ibm-watson


    【解决方案1】:

    您看到的数据是通过 JavaScript 加载的。 BeautifulSoup 无法执行,但您可以使用requests 模块从其 API 加载数据。

    例如:

    import json
    import requests
    
    
    url = 'https://www.aspc.co.uk/search/?PrimaryPropertyType=Rent&SortBy=PublishedDesc&LastUpdated=AddedAnytime&SearchTerm=&PropertyType=Residential&PriceMin=&PriceMax=&Bathrooms=&OrMoreBathrooms=true&Bedrooms=&OrMoreBedrooms=true&HasCentralHeating=false&HasGarage=false&HasDoubleGarage=false&HasGarden=false&IsNewBuild=false&IsDevelopment=false&IsParkingAvailable=false&IsPartExchangeConsidered=false&PublicRooms=&OrMorePublicRooms=true&IsHmoLicense=false&IsAllowPets=false&IsAllowSmoking=false&IsFullyFurnished=false&IsPartFurnished=false&IsUnfurnished=false&ExcludeUnderOffer=false&IncludeClosedProperties=true&ClosedDatesSearch=14&MapSearchType=EDITED&ResultView=LIST&ResultMode=NONE&AreaZoom=13&AreaCenter[lat]=57.14955426557916&AreaCenter[lng]=-2.0927401123046785&EditedZoom=13&EditedCenter[lat]=57.14955426557916&EditedCenter[lng]=-2.0927401123046785'
    api_url = 'https://api.aspc.co.uk/Property/GetProperties?{}&Sort=PublishedDesc&Page=1&PageSize=12'
    
    params = url.split('?')[-1]
    data = requests.get(api_url.format(params)).json()
    
    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))                     # <-- uncomment this to see all data received from server
    
    # print some data to screen:
    for property_ in data:
        print(property_['Location']['AddressLine1'])
        print(property_['CategorisationDescription'])
        print('Bedrooms:', property_["Bedrooms"])           # <-- print number of Bedrooms
        print('Bathrooms:', property_["Bathrooms"])         # <-- print number of Bathrooms
        print('PublicRooms:', property_["PublicRooms"])     # <-- print number of PublicRooms
        # .. etc.
        print('-' * 80)
    

    打印:

    44 Roslin Place
    Fully furnished 2 Bdrm 1st flr Flat. Hall. Lounge. Dining kitch. 2 Bdrms. Bathrm (CT band - C). Deposit 1 months rent. Parking. No pets. No smokers. Rent £550 p.m Entry by arr. Viewing contact solicitors. Landlord reg: 871287/100/26061. (EPC band - B).
    Bedrooms: 2
    Bathrooms: 1
    PublicRooms: 1
    --------------------------------------------------------------------------------
    Second Floor Left,  173 Victoria Road
    Unfurnished 1 Bdrm 2nd flr Flat. Hall. Lounge. Dining kitch. Bdrm. Bathrm (CT Band - A). Deposit 1 months rent. No pets. No smokers. Rent £375 p.m Immed entry. Viewing contact solicitors. Landlord reg: 1261711/100/09072. (EPC band - D).
    Bedrooms: 1
    Bathrooms: 1
    PublicRooms: 1
    --------------------------------------------------------------------------------
    102 Bedford Road
    Fully furnished 3 Bdrm 1st flr Flat. Hall. Lounge. Kitch. 3 Bdrms. Bathrm (CT band - B). Deposit 1 months rent. Garden. HMO License. No pets. No smokers. Rent £750 p.m Entry by arr. Viewing contact solicitors. Landlord reg: 49171/100/27130. (EPC band - D).
    Bedrooms: 3
    Bathrooms: 1
    PublicRooms: 1
    --------------------------------------------------------------------------------
    
    ... and so on.
    

    【讨论】:

    • 这非常有用,但我需要页面本身的部分信息:完整地址(包括邮政编码),没有休息室、浴室和平方英尺。我怎样才能提取它?谢谢
    • @Alex03 数据不在页面内,页面从远程API加载数据。如果您取消注释print(json.dumps(data, indent=4)),您将看到从服务器接收到的所有数据。我更新了我的示例。
    猜你喜欢
    • 1970-01-01
    • 2020-08-01
    • 2021-01-06
    • 1970-01-01
    • 2021-05-10
    • 1970-01-01
    • 1970-01-01
    • 2021-01-26
    • 2017-06-12
    相关资源
    最近更新 更多