该网站正在使用 js 脚本更新其元素,因此您将无法单独使用 beautifulsoup,您必须使用自动化。
下面的代码不起作用,因为元素会在几毫秒后更新。它最初会显示所有品牌,然后会更新并显示所选品牌,因此请使用自动化。
失败的代码:
from bs4 import BeautifulSoup as soup
import time
from urllib.request import urlopen as uReq
Url = 'https://www.dickssportinggoods.com/f/all-mens-footwear'
url_st = 'https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber=0&filterFacets=X_BRAND'
for idx, br in enumerate(brands_name):
if idx==0:
url_st += '%3A'+ '%20'.join(br.split(' '))
else:
url_st += '%2C' + '%20'.join(br.split(' '))
uClient = uReq(url_st)
time.sleep(4)
Page = uClient.read()
uClient.close()
page_soup = soup(Page, "html.parser")
for match in page_soup.find_all('div', class_='rs_product_description d-block'):
print(match.text)
代码:(selenium + bs4)
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from webdriver_manager.chrome import ChromeDriverManager
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install())#, chrome_options=chrome_options)
driver.set_window_size(1024, 600)
driver.maximize_window()
brands_name = ['New Balance']
filter_facet ='filterFacets=X_BRAND'
for idx, br in enumerate(brands_name):
if idx==0:
filter_facet += '%3A'+ '%20'.join(br.split(' '))
else:
filter_facet += '%2C' + '%20'.join(br.split(' '))
url = f"https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber=0&{filter_facet}"
driver.get(url)
time.sleep(4)
page_soup = soup(driver.page_source, 'html.parser')
elem = driver.find_element_by_class_name('close')
if elem:
elem.click()
for match in page_soup.find_all('div', class_='rs_product_description d-block'):
print(match.text)
page_num = page_soup.find_all('a', class_='rs-page-item')
pnum = [int(pn.text) for pn in page_num if pn.text!='']
if len(pnum)>=2:
for pn in range(1, len(pnum)):
url = f"https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber={pn}&{filter_facet}"
driver.get(url)
time.sleep(2)
page_soup = soup(driver.page_source, "html.parser")
for match in page_soup.find_all('div', class_='rs_product_description d-block'):
print(match.text)
New Balance Men's 410v6 Trail Running Shoes
New Balance Men's 623v3 Training Shoes
.
.
.
New Balance Men's Fresh Foam Beacon Running Shoes
New Balance Men's Fresh Foam Cruz v2 SockFit Running Shoes
New Balance Men's 470 Running Shoes
New Balance Men's 996v3 Tennis Shoes
New Balance Men's 1260 V7 Running Shoes
New Balance Men's Fresh Foam Beacon Running Shoes
我已经注释掉了无标题的 chrome,因为当你打开它时,你会在关闭它后得到一个对话框按钮,你可以获取产品详细信息。在无浏览器自动化中,您将无法做到(无法回答这个问题。硒概念不太好)
别忘了安装:webdriver_manager
使用pip install webdriver_manager