【问题标题】:Despite Selenium WebDriver only parts of the page are being parsed尽管 Selenium WebDriver 只有部分页面被解析
【发布时间】:2020-09-13 17:08:23
【问题描述】:

我正在尝试使用 BeautifulSoup 和 Selenium 从网站中提取数据,因为该网站有很多动态内容。尽管我使用 Selenium 来模拟 webdriver,但它返回的结果数量与仅使用 BeautifulSoup 相同。 len(container) 应该等于 20,但是它始终返回 4。我不确定我做错了什么,或者如何解决这个问题。以下是我的代码:

import bs4
import requests

from bs4 import BeautifulSoup as soup
from selenium import webdriver

url = 'https://www.immowelt.at/liste/wien/wohnungen/mieten?eqid=1011&cp=1'

options = webdriver.ChromeOptions() 
options.add_experimental_option("excludeSwitches", ["enable-logging"])
options.add_argument('--headless')
options.add_argument('--blink-settings=imagesEnabled=false')
driver = webdriver.Chrome(options=options, executable_path=r'C:\Users\xxx\chromedriver')
driver.get(url)
html = driver.page_source
page_soup = soup(html, 'html.parser')

container = page_soup.findAll('div', class_='listcontent clear')
print(len(container))

【问题讨论】:

    标签: python-3.x selenium-webdriver beautifulsoup selenium-chromedriver


    【解决方案1】:

    页面正在通过 JavaScript 动态加载项目。您可以使用此脚本加载所有(264 项):

    import requests 
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.immowelt.at/liste/wien/wohnungen/mieten?eqid=1011&cp=1'
    api_url = 'https://www.immowelt.at/liste/getlistitems'
    
    offset, pagesize = 0, 4
    with requests.session() as s:
        soup = BeautifulSoup(s.get(url).content, 'html.parser')
        query = soup.select_one('#filterView')['value']
        total = int(soup.h1.text.split()[0])
    
        all_data = ''
        while offset < total:
            print('Offset {}...'.format(offset))
            data = {'query': query,
                    'offset': offset,
                    'pageSize': pagesize}
            all_data += s.post(api_url, data=data).text
    
            offset += pagesize
    
    soup = BeautifulSoup(all_data, 'html.parser')
    
    for item in soup.select('.listitem'):
        print(item.h2.get_text(strip=True))
        print(item.select_one('.price_rent').get_text(strip=True, separator=' '))
        print('-' * 80)
    
    print('Total: ', len(soup.select('.listitem')))
    

    打印:

    ...
    
    NEUES WOHNJUWEL FÜR STUDIERENDE
    Gesamtmiete 850 €
    --------------------------------------------------------------------------------
    DACHGESCHOSSLOFT FÜR STUDIERENDE
    Gesamtmiete 750 €
    --------------------------------------------------------------------------------
    NEUES WOHNJUWEL FÜR STUDIERENDE
    Gesamtmiete 750 €
    --------------------------------------------------------------------------------
    Toplage in Mauer
    Gesamtmiete 995,50 €
    --------------------------------------------------------------------------------
    Total:  264
    

    【讨论】:

      【解决方案2】:
      # Your code is Fine. This is just bad programming by the web developer.
      
      inside  this <'div', class_='listcontent clear'> there are only 4 elements, rest 16 are in a diffrent div all together
      
      you can verify this using the below xpath >> returns 4 matches only
      /html/body/div[2]/div[2]/div[5]/div[2]/div[2]/div[2]/div/div
      
      use this instead and filter accordingly
      container = page_soup.findAll('div', class_='js-object   listitem_wrap')
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-06-29
        • 2012-07-01
        • 1970-01-01
        • 2017-04-24
        相关资源
        最近更新 更多