尽管 Selenium WebDriver 只有部分页面被解析答案

【问题标题】：Despite Selenium WebDriver only parts of the page are being parsed尽管 Selenium WebDriver 只有部分页面被解析
【发布时间】：2020-09-13 17:08:23
【问题描述】：

我正在尝试使用 BeautifulSoup 和 Selenium 从网站中提取数据，因为该网站有很多动态内容。尽管我使用 Selenium 来模拟 webdriver，但它返回的结果数量与仅使用 BeautifulSoup 相同。 len(container) 应该等于 20，但是它始终返回 4。我不确定我做错了什么，或者如何解决这个问题。以下是我的代码：

import bs4
import requests

from bs4 import BeautifulSoup as soup
from selenium import webdriver

url = 'https://www.immowelt.at/liste/wien/wohnungen/mieten?eqid=1011&cp=1'

options = webdriver.ChromeOptions() 
options.add_experimental_option("excludeSwitches", ["enable-logging"])
options.add_argument('--headless')
options.add_argument('--blink-settings=imagesEnabled=false')
driver = webdriver.Chrome(options=options, executable_path=r'C:\Users\xxx\chromedriver')
driver.get(url)
html = driver.page_source
page_soup = soup(html, 'html.parser')

container = page_soup.findAll('div', class_='listcontent clear')
print(len(container))

【问题讨论】：

标签： python-3.x selenium-webdriver beautifulsoup selenium-chromedriver

【解决方案1】：

页面正在通过 JavaScript 动态加载项目。您可以使用此脚本加载所有（264 项）：

import requests 
from bs4 import BeautifulSoup


url = 'https://www.immowelt.at/liste/wien/wohnungen/mieten?eqid=1011&cp=1'
api_url = 'https://www.immowelt.at/liste/getlistitems'

offset, pagesize = 0, 4
with requests.session() as s:
    soup = BeautifulSoup(s.get(url).content, 'html.parser')
    query = soup.select_one('#filterView')['value']
    total = int(soup.h1.text.split()[0])

    all_data = ''
    while offset < total:
        print('Offset {}...'.format(offset))
        data = {'query': query,
                'offset': offset,
                'pageSize': pagesize}
        all_data += s.post(api_url, data=data).text

        offset += pagesize

soup = BeautifulSoup(all_data, 'html.parser')

for item in soup.select('.listitem'):
    print(item.h2.get_text(strip=True))
    print(item.select_one('.price_rent').get_text(strip=True, separator=' '))
    print('-' * 80)

print('Total: ', len(soup.select('.listitem')))

打印：

...

NEUES WOHNJUWEL FÜR STUDIERENDE
Gesamtmiete 850 €
--------------------------------------------------------------------------------
DACHGESCHOSSLOFT FÜR STUDIERENDE
Gesamtmiete 750 €
--------------------------------------------------------------------------------
NEUES WOHNJUWEL FÜR STUDIERENDE
Gesamtmiete 750 €
--------------------------------------------------------------------------------
Toplage in Mauer
Gesamtmiete 995,50 €
--------------------------------------------------------------------------------
Total:  264

【讨论】：

【解决方案2】：

# Your code is Fine. This is just bad programming by the web developer.

inside  this <'div', class_='listcontent clear'> there are only 4 elements, rest 16 are in a diffrent div all together

you can verify this using the below xpath >> returns 4 matches only
/html/body/div[2]/div[2]/div[5]/div[2]/div[2]/div[2]/div/div

use this instead and filter accordingly
container = page_soup.findAll('div', class_='js-object   listitem_wrap')

【讨论】：