【问题标题】:Scraping all markers from iframe with Python and Selenium使用 Python 和 Selenium 从 iframe 中抓取所有标记
【发布时间】:2021-07-28 13:08:26
【问题描述】:

我正在尝试从该网络中嵌入的地图中抓取公司名称和链接:https://www.elitedynamics.co.uk/customers

我开发的代码现在进入页面,向下滚动直到找到第一个按钮(每个标记都是一个按钮)。然后单击按钮,显示并选择信息,按钮关闭,驱动程序进入下一个结果。这是非常混乱的,因为司机无法遵循命令并重复元素。有没有更好的方法?

driver_path='chromedriver'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--start-maximized")
driver = webdriver.Chrome(executable_path=driver_path,options = chrome_options)
driver.get("https://www.elitedynamics.co.uk/customers")
property_bubble = driver.find_element_by_xpath('//div[@role="button"]')

actions = ActionChains(driver)
actions.move_to_element(property_bubble).click(property_bubble).perform()
all_properties = driver.find_elements_by_xpath('//div[@role="button"]')
names_list =[]
links_list=[]

for property in all_properties:
    actions.move_to_element(property).click(property).perform()
    wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME,'wpgmza_infowindow_description')))
    property_name = driver.find_element_by_xpath('//div[@class="wpgmza_infowindow_description"]/h4')
    names_list.append(property_name.text)
    print(property_name)
    try:
        property_link = driver.find_element_by_xpath('//div[@class="wpgmza_infowindow_description"]/h4/a')
        links_list.append(property_link.get_attribute('href'))
        print(property_link)
    except:
        try:
            property_link = driver.find_element_by_xpath('//div[@class="wpgmza_infowindow_description"]/h4/p/a')
            links_list.append(property_link.get_attribute('href'))  
            print(property_link)
        except:
            pass    
    time.sleep(2)
    driver.find_element_by_xpath('//button[@title="Close"]').click()
print(names_list)
print(links_list)

【问题讨论】:

    标签: selenium web-scraping iframe


    【解决方案1】:

    实际上,要抓取网站,您不需要使用 selenium,因为所需的数据几乎都是从外部源生成的 json。

    这是可行的解决方案:

    import requests
    import pandas as pd
    
    params = {
        'filter': '{"map_id":"4","mashupIDs":[],"customFields":[]}'
        
    }
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
    }
    
    
    def main(url):
        with requests.Session() as req:
            req.headers.update(headers)
            elit = []
            r = req.get(url, params=params)
            for item in r.json()['markers']:
                elit.append([item['title'], item['icon']['url']])
    
            df = pd.DataFrame(elit, columns=["Title", "Url"])
            print(df)
    
    
    main(
        'https://www.elitedynamics.co.uk/wp-json/wpgmza/v1/features/')
    

    输出:

                        Title                                                Url
    0              Landal Darwin Forest  //www.elitedynamics.co.uk/wp-content/uploads/2...  
    1                 Landal Sandybrook  //www.elitedynamics.co.uk/wp-content/uploads/2...  
    2             Pinewood Holiday Park  //www.elitedynamics.co.uk/wp-content/uploads/2...  
    3           Peppermint Holiday Park  //www.elitedynamics.co.uk/wp-content/uploads/2...  
    4          Riviera Bay Holiday Park  //www.elitedynamics.co.uk/wp-content/uploads/2...  
    ..                              ...                                                ...  
    250        Hedley Wood Holiday Park  //www.elitedynamics.co.uk/wp-content/uploads/2...  
    251  Ashbourne Heights Holiday Park  //www.elitedynamics.co.uk/wp-content/uploads/2...  
    252    Sand le Mere Holiday Village  //www.elitedynamics.co.uk/wp-content/uploads/2...  
    253                    Bowland Fell  //www.elitedynamics.co.uk/wp-content/uploads/2...  
    254       Silver Sands Holiday Park  //www.elitedynamics.co.uk/wp-content/uploads/2...  
    
    [255 rows x 2 columns]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-16
      • 1970-01-01
      • 2022-11-03
      • 2021-06-07
      • 2020-01-10
      • 1970-01-01
      相关资源
      最近更新 更多