【问题标题】:Combine parsing of html inside and outside of iframe w/ BeautifulSoup结合使用 BeautifulSoup 的 iframe 内部和外部的 html 解析
【发布时间】:2020-07-27 12:13:59
【问题描述】:

我正在尝试使用 Selenium/BS4/Python 从房地产列表网站上抓取数据。该脚本在解析每个列表页面上的 html 之前提取列表页面链接。我正在使用一个名为 Property Data 的 Chrome 扩展程序,它在每个页面上显示为 iframe,并显示列表所在特定区域的数据(例如,该邮政编码/邮政编码的平均价格、产量等。)加载大约需要 15 秒每个列表页面上的结果。请看页面右侧的扩展截图:
https://imgur.com/a/mjodyts

来自 Chrome 的 iframe html 检查:

<div class="row one-col print-hidden"><div class="cell"><div class="module"><div class="hl-1 pad-16" style="padding-top: 0 !important;"><div class="propertydata" style="height: 410px; overflow: none;"><iframe scrolling="no" style="width: 302px; height: 410px; margin: 0; border: 0;" src="https://propertydata.co.uk/extension/1.3/51.579520/-0.235261/rightmove/78617383/399950/2"></iframe></div></div></div></div></div>

我的问题是:
在列表页面上加载 chrome 扩展 iframe 和结果后,如何修改下面代码中的 get_html_data 函数,使其首先解析现有页面的 html,然后切换到解析和定位 iframe 中的元素?

rightmove_hmo_search = "https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=POSTCODE%5E1619792&maxBedrooms=4&minBedrooms=2&maxPrice=500000&radius=10.0&sortType=18&propertyTypes=&maxDaysSinceAdded=14&includeSSTC=false&mustHave=&dontShow=newHome%2CsharedOwnership%2Cretirement&furnishTypes=&keywords=stpp%2Cloft"

#identify and extract listing links from each page (in this case first page only)
def get_house_links(url, driver, pages=1):
    house_links = []
    driver.get(url)
    for i in range(pages):
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        listings = soup.find_all("a", class_="propertyCard-moreInfoItem is-carousel")
        page_data = ['https://rightmove.co.uk' + row['href'] for row in listings]
        house_links.append(page_data)
        print(house_links)
       # next_button = soup.select('button[class="pagination-button pagination-direction pagination-direction--next"]')
       # if next_button:
         #   for page in range(0, 1):
        #        index = page * 24
        #        next_button_link = 'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=POSTCODE%5E1619792&maxBedrooms=4&minBedrooms=2&maxPrice=500000&radius=10.0&sortType=18&' + '&index=' + str(index) + '&propertyTypes=&maxDaysSinceAdded=14&includeSSTC=false&mustHave=&dontShow=newHome%2CsharedOwnership%2Cretirement&furnishTypes=&keywords=stpp%2Cloft'
         #       driver.get(next_button_link)
         #       if page > 1:
          #          break


    return house_links


#get html data from url and return as object
def get_html_data(url, driver):
    driver.get(url)
    try:
        WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.XPATH, "//iframe[contains(@class, 'key')]")))
    except TimeoutException:
        print("page took too long to load")
    BeautifulSoup(driver.page_source, 'html.parser')
    driver.switch_to.frame(driver.find_element_by_tag_name("iframe"))
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    return soup

【问题讨论】:

    标签: html selenium iframe beautifulsoup


    【解决方案1】:

    我不知道你为什么还要使用selenium,仅供参考,数据已经位于script 标签下的page 源中。所以你甚至不需要收集urls 然后循环访问它们中的每一个。

    所有这些都可以在一次调用中完成!

    我已将其加载到 JSON dict 中,因此您可以访问它并解析您需要的任何内容。

    import requests
    import re
    import json
    
    
    def main(url):
        r = requests.get(url)
        match = re.search(r"window.jsonModel = ({.+})", r.text).group(1)
        data = json.loads(match)
        hview = json.dumps(data, indent=4)
        print(data.keys())
        print(hview)
    
    
    main("https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=POSTCODE%5E1619792&maxBedrooms=4&minBedrooms=2&maxPrice=500000&radius=10.0&sortType=18&propertyTypes=&maxDaysSinceAdded=14&includeSSTC=false&mustHave=&dontShow=newHome%2CsharedOwnership%2Cretirement&furnishTypes=&keywords=stpp%2Cloft")
    

    输出:

    dict_keys(['properties', 'resultCount', 'searchParametersDescription', 'radiusOptions', 'priceOptions', 'bedroomOptions', 'addedToSiteOptions', 'mustHaveOptions', 'dontShowOptions', 'furnishOptions', 'letTypeOptions', 'sortOptions', 'applicationProperties', 'staticMapUrl', 'shortLocationDescription', 'timestamp', 'bot', 'deviceType', 'propertySchema', 'sidebarModel', 
    'seoModel', 'mapViewUrl', 'legacyUrl', 'listViewUrl', 'pageTitle', 'metaDescription', 'recentSearchModel', 'maxCardsPerPage', 'countryCode', 'countryId', 'currencyCodeOptions', 'areaSizeUnitOptions', 'sizeOptions', 'priceTypeOptions', 'showFeaturedAgent', 'showNewDrawASearch', 'commercialChannel', 'disambiguationPagePath', 'dfpModel', 'noResultsModel', 'urlPath', 'tileGeometry', 'geohashTerms', 'comscore', 'cookiePolicies', 'formattedExchangeRateDate', 'authenticated', 'location', 'searchParameters', 'featureSwitchStateForUser', 'pagination'])
    

    你想要的数据在那个键properties中:

    例如,这是第一个报价:

    "properties": [
            {
                "id": 78658603,
                "bedrooms": 2,
                "numberOfImages": 10,
                "numberOfFloorplans": 1,
                "numberOfVirtualTours": 0,
                "summary": "Presented in outstanding condition, this gorgeous 2 bedroom apartment is set within a modern development and offers a bright open-plan reception room/kitchen, excellent fixtures and a charming private Balcony.",
                "displayAddress": "Haydons Road, Wimbledon, London, SW19",
                "countryCode": "GB",
                "location": {
                    "latitude": 51.42015,
                    "longitude": -0.187532
                },
                "propertySubType": "Flat",
                "listingUpdate": {
                    "listingUpdateReason": "new",
                    "listingUpdateDate": "2020-04-03T11:08:02Z"
                },
                "premiumListing": false,
                "featuredProperty": true,
                "price": {
                    "amount": 500000,
                    "frequency": "not specified",
                    "currencyCode": "GBP",
                    "displayPrices": [
                        {
                            "displayPrice": "\u00a3500,000",
                            "displayPriceQualifier": ""
                        }
                    ]
                },
                "customer": {
                    "branchId": 15975,
                    "brandPlusLogoURI": "/company/clogo_rmchoice_5187_0012.jpeg",
                    "contactTelephone": "020 8012 6808",
                    "branchDisplayName": "Foxtons, Wimbledon",
                    "branchName": "Wimbledon",
                    "brandTradingName": "Foxtons",
                    "branchLandingPageUrl": "/estate-agents/agent/Foxtons/Wimbledon-15975.html",
                    "development": false,
                    "showReducedProperties": true,
                    "commercial": false,
                    "showOnMap": true,
                    "brandPlusLogoUrl": "https://media.rightmove.co.uk:443/dir/company/clogo_rmchoice_5187_0012_max_100x50.jpeg"
                },
                "distance": 8.0669,
                "transactionType": "buy",
                "productLabel": {
                    "productLabelText": ""
                },
                "commercial": false,
                "development": false,
                "residential": true,
                "students": false,
                "auction": false,
                "feesApply": false,
                "feesApplyText": null,
                "displaySize": "",
                "showOnMap": true,
                "propertyUrl": "/property-for-sale/property-78658603.html",
                "contactUrl": "/property-for-sale/contactBranch.html?propertyId=78658603",
                "channel": "BUY",
                "firstVisibleDate": "2020-04-03T11:02:42Z",
                "keywords": [
                    {
                        "keyword": "stpp",
                        "matched": false
                    },
                    {
                        "keyword": "loft",
                        "matched": false
                    }
                ],
                "keywordMatchType": "no_match",
                "saved": null,
                "hidden": null,
                "onlineViewingsAvailable": false,
                "propertyImages": {
                    "images": [
                        {
                            "srcUrl": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_02_0000_max_476x317.jpg",
                            "url": "16k/15975/78658603/15975_1130605_IMG_02_0000.jpg"
                        },
                        {
                            "srcUrl": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_01_0000_max_476x317.jpg",
                            "url": "16k/15975/78658603/15975_1130605_IMG_01_0000.jpg"
                        },
                        {
                            "srcUrl": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_03_0000_max_476x317.jpg",
                            "url": "16k/15975/78658603/15975_1130605_IMG_03_0000.jpg"
                        },
                        {
                            "srcUrl": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_04_0000_max_476x317.jpg",
                            "url": "16k/15975/78658603/15975_1130605_IMG_04_0000.jpg"
                        },
                        {
                            "srcUrl": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_05_0000_max_476x317.jpg",
                            "url": "16k/15975/78658603/15975_1130605_IMG_05_0000.jpg"
                        },
                        {
                            "srcUrl": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_06_0000_max_476x317.jpg",
                            "url": "16k/15975/78658603/15975_1130605_IMG_06_0000.jpg"
                        },
                        {
                            "srcUrl": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_07_0000_max_476x317.jpg",
                            "url": "16k/15975/78658603/15975_1130605_IMG_07_0000.jpg"
                        },
                        {
                            "srcUrl": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_08_0000_max_476x317.jpg",
                            "url": "16k/15975/78658603/15975_1130605_IMG_08_0000.jpg"
                        },
                        {
                            "srcUrl": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_09_0000_max_476x317.jpg",
                            "url": "16k/15975/78658603/15975_1130605_IMG_09_0000.jpg"
                        },
                        {
                            "srcUrl": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_10_0000_max_476x317.jpg",
                            "url": "16k/15975/78658603/15975_1130605_IMG_10_0000.jpg"
                        }
                    ],
                    "mainImageSrc": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_02_0000_max_476x317.jpg",
                    "mainMapImageSrc": "https://media.rightmove.co.uk:443/dir/crop/10:9-16:9/16k/15975/78658603/15975_1130605_IMG_02_0000_max_296x197.jpg"
                },
                "displayStatus": "",
                "formattedBranchName": " by Foxtons, Wimbledon",
                "addedOrReduced": "Added on 03/04/2020",
                "isRecent": false,
                "formattedDistance": "8.07 miles",
                "heading": "Featured Property",
                "hasBrandPlus": true,
                "propertyTypeFullDescription": "2 bedroom flat for sale"
            }
    

    【讨论】:

    • 嘿,艾哈迈德,我是 InfinityTM
    • @JoshuaVarghese 我看到您更改了帐户名称
    • 是我的真名。顺便说一句,你试过stackoverflow.com/questions/61012763/… 吗?
    • @JoshuaVarghese 还没有,这个问题之前已经回答过多次了,它的目标有点可疑
    • 那个答案不起作用。我试过了。除了我的代理失败问题我开始赏金
    猜你喜欢
    • 2012-09-16
    • 1970-01-01
    • 1970-01-01
    • 2020-11-29
    • 2017-06-02
    • 2011-12-28
    • 2011-10-13
    • 2021-05-08
    • 1970-01-01
    相关资源
    最近更新 更多