【问题标题】:Not able to access website url using beautiful soup and python while web scraping网页抓取时无法使用漂亮的汤和python访问网站网址
【发布时间】:2021-02-24 06:45:35
【问题描述】:

我正在抓取的链接:https://www.indusind.com/in/en/personal/cards/credit-card.html

from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re, sys
from selenium import webdriver
import re

IndusInd_url = "https://www.indusind.com/in/en/personal/cards/credit-card.html"

html = requests.get(IndusInd_url)
soup = BeautifulSoup(html.content, 'lxml')

print(soup)


for x in soup.select("#display-product-cards .text-primary"):
    print(x.get_text())

使用上面的代码我试图抓取卡片的标题,但不幸的是我得到了这个输出

<html><body><p>This website is secured against online attacks. Your request was blocked due to suspicious behavior<br/>
<br/>
 Client IP : 124.123.170.109<br/>
<br/>
Incident Time : 2021-02-24 06:28:10 UTC <br/>
<br/>
 Incident ID : YDXx@m6g3nSFLvi5lGg4wgAAAf8<br/>
<br/>
If you feel it was a legitimate request, please contact the website owner for further investigation and remediation with a screenshot of this page.</p></body></html>

是否有任何其他替代方法可以用来抓取详细信息。

非常感谢任何帮助! ! !

【问题讨论】:

    标签: python selenium web-scraping beautifulsoup python-requests


    【解决方案1】:

    请检查这个。 仅供参考:确保您拥有正确的驱动程序(firefox 或 chrome 或任何具有正确版本的驱动程序)

    from selenium import webdriver
    import requests
    from bs4 import BeautifulSoup
    import time
    
    url = 'https://www.indusind.com/in/en/personal/cards/credit-card.html'
    
    # open the chrome driver
    driver = webdriver.Chrome(executable_path='webdrivers/chromedriver.exe')
    
    # pings the specified url
    driver.get(url)
    
    # sleep time to wait for t seconds to wait for page load
    # replace 3 with any int value (int value in seconds)
    time.sleep(3)
    
    # gets the page source
    pg = driver.page_source
    
    # beautify with beautifulsoup
    soup = BeautifulSoup(pg)
    
    # get the titles of the card
    for x in soup.select("#display-product-cards .text-primary"):
        print(x.get_text())
    

    下面是输出图片

    【讨论】:

    • @Bum Bum Bole,有时由于网站服务器繁忙或互联网问题,网站可能加载有点晚,驱动程序可能无法获取页面源。在更安全的方面,您可以在 driver.get(url) 之后提及睡眠时间。首先导入时间,在 driver.get(url) 之后,提到这行代码 time.sleep(3),您可以将 3 替换为以秒为单位的任何 int 值。我将编辑上面的代码并添加它。
    【解决方案2】:

    不用BeautifulSoup也可以实现。

    我用 xpath 定义定位器的值:

    //div[@id='display-product-cards']//a[@class='card-title text-primary' and text()!='']
    

    并利用方法.presence_of_all_elements_located

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome(executable_path='webdrivers/chromedriver.exe')
    
    driver.get('https://www.indusind.com/in/en/personal/cards/credit-card.html')
    
    wait = WebDriverWait(driver, 20)
    elements = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[@id='display-product-cards']//a[@class='card-title text-primary' and text()!='']")))
    
    for element in elements:
        print(element.get_attribute('innerHTML'))
    
    driver.quit()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-01-08
      • 2020-03-22
      • 2022-01-20
      • 1970-01-01
      • 1970-01-01
      • 2015-01-26
      • 2012-12-13
      • 1970-01-01
      相关资源
      最近更新 更多