【发布时间】:2025-11-29 14:25:01
【问题描述】:
我正在尝试抓取使用 JS 对象的网页。
我在 Python 环境中使用 Selenium;我使用 selenium 加载我想要的内容,即启动模式容器的“查看选择电视包详细信息”文本。
在这个容器中,有包标题,下面有通道。我正在尝试遍历每个标题,并在每个标题中获取频道名称。
这是webpage
这是我的代码,它将帮助您导航到我试图抓取的容器:
from selenium import webdriver
import pandas as pd
url = "https://www.rogers.com/consumer/tv#/packages"
#create a new Chrome session
driver = webdriver.Chrome()
driver.implicitly_wait(5)
driver.get(url)
#change the province to Ontario
province_button = driver.find_element_by_class_name("dropdown-toggle")
province_button.click() #clicks dropdown
province_button = driver.find_element_by_link_text("Ontario")
province_button.click() #clicks dropdown
#visit TV portal page, re-init url again
driver.get(url)
#####BEGIN SCRAPING PACKAGE INFO#####
#open Select Package window
package_button = driver.find_element_by_class_name("Package-details")
package_button.click() #clicks dropdown
package_data = driver.find_elements_by_class_name("Package-channels")
package_data var 返回我所有的标题和频道名称;但没有指明哪些字符串是标题,哪些是频道。我知道我可以编写一些复杂的正则表达式来解决问题,但我希望采用动态方法。任何建议表示赞赏。谢谢!
******已编辑*******
下面的每个 cmets,下面是将 WebElements 带入变量而不是输出到控制台的代码:
select_package_data = []
headingsCount = len(driver.find_elements_by_xpath("//div[@class='modal-
content']//*[contains(@class,'Package-channels--heading ng-binding')]"))
for index in range(headingsCount):
head = driver.find_element_by_xpath("//div[@class='modal-content']//*
[contains(@class,'Package-channels--heading ng-binding')]
[index]".replace('index',str(index+1)))
select_package_data.append(head.text)
channelsPerheading = driver.find_elements_by_xpath("(//div[@class='modal-
content']//ul[@ng-if='vm.channels'])[index]/li[not
(contains(@class,'Package-channels--heading ng-
binding'))]".replace('index',str(index+1)))
temp_list=[]
for channel in channelsPerheading:
temp_list.append(channel.text.encode('utf-8'))
select_package_data.insert((index+1), temp_list[:])`
*********根据评论编辑 V2:*********
最终代码需要在 xpath 方法中添加括号;我相信这是由于在将其分配给变量时将[index] 附加到实际xpath 的末尾:
#get the count of headings in the modal contaier
headingsCount = len(driver.find_elements_by_xpath("//div[@class='modal-
content']//*[contains(@class,'Package-channels--heading ng-binding')]"))
#use this count as an iterator
for index in range(headingsCount):
#get the first heading - we use replace method bc xpath is not zero-indexed
head = driver.find_element_by_xpath("(//div[@class='modal-content']//*
[contains(@class,'Package-channels--heading ng-binding')])
[index]".replace('index',str(index+1)))
header_placeholder = head.text
##takes heading element as text to use for dataframe row index label
#goes to //ul tag in accordance with current index, finds all BUT the
#headings
channelsPerheading = driver.find_elements_by_xpath("(//div[@class='modal-
content']//ul[@ng-if='vm.channels'])[index]/li[not
(contains(@class,'Package-channels--heading ng-
binding'))]".replace('index',str(index+1)))
temp_list=[]
for channel in channelsPerheading: #append the channels as text to a temp
list
temp_list.append(channel.text.encode('utf-8'))
【问题讨论】:
标签: python html selenium web-scraping selenium-chromedriver