【问题标题】:How do I scrape price/tax history table in zillow?如何在 zillow 中抓取价格/税收历史记录表?
【发布时间】:2018-03-19 23:17:14
【问题描述】:
from bs4 import BeautifulSoup
from selenium import webdriver
#import urllib2
import time

driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://www.zillow.com/homes/recently_sold/Culver-City-CA/house,condo,apartment_duplex,townhouse_type/20432063_zpid/51617_rid/12m_days/globalrelevanceex_sort/34.048605,-118.340178,33.963223,-118.47785_rect/12_zm/")
time.sleep(3)
driver.find_element_by_class_name("collapsible-header").click()
soup = BeautifulSoup(driver.page_source,"lxml")

region = soup.find("div",{"id":"hdp-price-history"})
table = region.find('table',{'class':'zsg-table yui3-toggle-content-minimized'})
print table

我尝试在 zillow 中刮取税/表价格,但我得到的结果是 None。我怎么得到那张桌子?

【问题讨论】:

    标签: python selenium web-scraping beautifulsoup


    【解决方案1】:

    以下使用requestsBeautifulSoup 来获取数据,不需要硒(所以很快)。

    from bs4 import BeautifulSoup
    import requests
    import re
    
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0"}    
    r = requests.get("https://www.zillow.com/homes/recently_sold/Culver-City-CA/house,condo,apartment_duplex,townhouse_type/20432063_zpid/51617_rid/12m_days/globalrelevanceex_sort/34.048605,-118.340178,33.963223,-118.47785_rect/12_zm/", headers=headers)
    urls = re.findall(re.escape('AjaxRender.htm?') + '(.*?)"', r.content)
    url = "https://www.zillow.com/AjaxRender.htm?{}".format(urls[4])
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content.replace('\\', ''), "html.parser")
    data = []
    
    for tr in soup.find_all('tr'):
        data.append([td.text for td in tr.find_all('td')])
    
    for row in data[:5]:        # Show first 5 entries    
        print row
    

    这表明前 5 个条目是:

    [u'06/16/17', u'Sold', u'$940,000-0.9%', u'K. Miller, A. Masket', u'']
    [u'06/14/17', u'Price change', u'$949,000-1.0%', u'', u'']
    [u'05/08/17', u'Pending sale', u'$959,000', u'', u'']
    [u'04/17/17', u'Price change', u'$959,000+1.1%', u'', u'']
    [u'02/27/17', u'Pending sale', u'$949,000', u'', u'']
    

    所需的 HTML 不存在于第一个 GET 中,但它在 Price / Tax History 部分展开时按需生成。这会在浏览器中触发 AJAX 请求。代码在初始 HTML 中搜索所有这些请求并发出相同的请求。第四个这样的请求用于获取所需的部分。返回的 HTML 需要删除 \,然后可以传递给 BeautifulSoup 以解析为表格。

    【讨论】:

    • 你帮了我很多忙!非常感谢!
    • 很高兴我能帮上忙!不要忘记单击向上/向下按钮下的灰色勾号以选择答案作为接受的解决方案。
    【解决方案2】:

    您不需要使用 BeautifulSoup。您可以使用以下代码获取所需的表格:

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait as wait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get("https://www.zillow.com/homes/recently_sold/Culver-City-CA/house,condo,apartment_duplex,townhouse_type/20432063_zpid/51617_rid/12m_days/globalrelevanceex_sort/34.048605,-118.340178,33.963223,-118.47785_rect/12_zm/")
    wait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "collapsible-header"))).click()
    table = wait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div#hdp-price-history table.zsg-table.yui3-toggle-content-minimized")))
    print(table.text)
    

    必需的表是动态生成的,因此您需要等待一段时间,直到它出现在 DOM 中。这就是为什么点击后无法在页面源中找到表格的原因

    【讨论】:

      猜你喜欢
      • 2021-04-23
      • 2016-07-04
      • 1970-01-01
      • 2015-10-08
      • 1970-01-01
      • 1970-01-01
      • 2021-05-23
      • 1970-01-01
      • 2014-12-19
      相关资源
      最近更新 更多