网页抓取时无法使用漂亮的汤和python访问网站网址答案

【问题标题】：Not able to access website url using beautiful soup and python while web scraping网页抓取时无法使用漂亮的汤和python访问网站网址
【发布时间】：2021-02-24 06:45:35
【问题描述】：

我正在抓取的链接：https://www.indusind.com/in/en/personal/cards/credit-card.html

from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re, sys
from selenium import webdriver
import re

IndusInd_url = "https://www.indusind.com/in/en/personal/cards/credit-card.html"

html = requests.get(IndusInd_url)
soup = BeautifulSoup(html.content, 'lxml')

print(soup)


for x in soup.select("#display-product-cards .text-primary"):
    print(x.get_text())

使用上面的代码我试图抓取卡片的标题，但不幸的是我得到了这个输出

<html><body><p>This website is secured against online attacks. Your request was blocked due to suspicious behavior<br/>
<br/>
 Client IP : 124.123.170.109<br/>
<br/>
Incident Time : 2021-02-24 06:28:10 UTC <br/>
<br/>
 Incident ID : YDXx@m6g3nSFLvi5lGg4wgAAAf8<br/>
<br/>
If you feel it was a legitimate request, please contact the website owner for further investigation and remediation with a screenshot of this page.</p></body></html>

是否有任何其他替代方法可以用来抓取详细信息。

非常感谢任何帮助！！！

【问题讨论】：

标签： python selenium web-scraping beautifulsoup python-requests

【解决方案1】：

请检查这个。仅供参考：确保您拥有正确的驱动程序（firefox 或 chrome 或任何具有正确版本的驱动程序）

from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import time

url = 'https://www.indusind.com/in/en/personal/cards/credit-card.html'

# open the chrome driver
driver = webdriver.Chrome(executable_path='webdrivers/chromedriver.exe')

# pings the specified url
driver.get(url)

# sleep time to wait for t seconds to wait for page load
# replace 3 with any int value (int value in seconds)
time.sleep(3)

# gets the page source
pg = driver.page_source

# beautify with beautifulsoup
soup = BeautifulSoup(pg)

# get the titles of the card
for x in soup.select("#display-product-cards .text-primary"):
    print(x.get_text())

下面是输出图片

【讨论】：

@Bum Bum Bole，有时由于网站服务器繁忙或互联网问题，网站可能加载有点晚，驱动程序可能无法获取页面源。在更安全的方面，您可以在 driver.get(url) 之后提及睡眠时间。首先导入时间，在 driver.get(url) 之后，提到这行代码 time.sleep(3)，您可以将 3 替换为以秒为单位的任何 int 值。我将编辑上面的代码并添加它。

【解决方案2】：

不用BeautifulSoup也可以实现。

我用 xpath 定义定位器的值：

//div[@id='display-product-cards']//a[@class='card-title text-primary' and text()!='']

并利用方法.presence_of_all_elements_located。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(executable_path='webdrivers/chromedriver.exe')

driver.get('https://www.indusind.com/in/en/personal/cards/credit-card.html')

wait = WebDriverWait(driver, 20)
elements = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[@id='display-product-cards']//a[@class='card-title text-primary' and text()!='']")))

for element in elements:
    print(element.get_attribute('innerHTML'))

driver.quit()

【讨论】：