【问题标题】:Files not being downloaded Properly. (Selenium)文件未正确下载。 (硒)
【发布时间】:2021-11-16 08:38:34
【问题描述】:

我想从链接列表中下载。

test_list = ['https://dibbs2.bsm.dla.mil/Downloads/RFQ/8/SPE1C122Q0058.PDF', 'https://dibbs2.bsm.dla.mil/Downloads/RFQ/8/SPE2DH22Q0028.PDF',
             'https://dibbs2.bsm.dla.mil/Downloads/RFQ/9/SPE2DH22Q0029.PDF', 'https://dibbs2.bsm.dla.mil/Downloads/RFQ/3/SPE2DS22Q0023.PDF',
             'https://dibbs2.bsm.dla.mil/Downloads/RFQ/1/SPE2DS22Q0031.PDF', 'https://dibbs2.bsm.dla.mil/Downloads/RFQ/3/SPE2DS22Q0033.PDF']

但是这个脚本也在下载单个文件的多个副本。如何避免这种情况?我只想下载列表中的六个pdf文件。

options = webdriver.ChromeOptions()
# options.add_argument('--no-sandbox')
# # options.add_argument('--disable-dev-shm-usage')
# options.headless = True
# prefs = {"download.default_directory": zip_dir,
#          "download.directory_upgrade": True,
#          "download.manager.showWhenStarting": False,
#          "download.manager.useWindow": False,
#          "helperApps.alwaysAsk.force":False,
#          "download.manager.showAlertOnComplete": False}
# options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)

dn = len(test_list)

for t in range(0,dn):
    URL = test_list[t]
    sleep(randint(3, 9))
    driver.get(URL)
    try:
        driver.find_element_by_id("butAgree").click()
    except:
        pass

也尝试过:

for t in test_list:
    URL = t
    sleep(randint(3, 9))
    driver.get(URL)
    try:
        driver.find_element_by_id("butAgree").click()
    except:
        pass

【问题讨论】:

    标签: python python-3.x selenium selenium-webdriver selenium-chromedriver


    【解决方案1】:

    你可以不用 Selenium,BeautifulSoup 就足够了。诀窍是首先从基本 url https://dibbs2.bsm.dla.mil/dodwarning.aspx 检索验证密钥,然后使用这些密钥下载文件:

    from bs4 import BeautifulSoup
    import requests
    import time
    
    test_list = ['https://dibbs2.bsm.dla.mil/Downloads/RFQ/8/SPE1C122Q0058.PDF', 'https://dibbs2.bsm.dla.mil/Downloads/RFQ/8/SPE2DH22Q0028.PDF',
                 'https://dibbs2.bsm.dla.mil/Downloads/RFQ/9/SPE2DH22Q0029.PDF', 'https://dibbs2.bsm.dla.mil/Downloads/RFQ/3/SPE2DS22Q0023.PDF',
                 'https://dibbs2.bsm.dla.mil/Downloads/RFQ/1/SPE2DS22Q0031.PDF', 'https://dibbs2.bsm.dla.mil/Downloads/RFQ/3/SPE2DS22Q0033.PDF']
    
    s = requests.Session()
    
    def get_file(url):
        pagereq = s.get('https://dibbs2.bsm.dla.mil/dodwarning.aspx')
        soup = BeautifulSoup(pagereq.content, 'html.parser')
    
        viewstategenerator = soup.find("input", attrs = {'id': '__VIEWSTATEGENERATOR'})['value']
        viewstate = soup.find("input", attrs = {'id': '__VIEWSTATE'})['value']
        eventvalidation = soup.find("input", attrs = {'id': '__EVENTVALIDATION'})['value']
    
        headers = {
            'Origin': 'https://dibbs2.bsm.dla.mil',
            'Content-Type': 'application/x-www-form-urlencoded',
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        }
    
        params = (
            ('goto', url.split('.mil', 1)[1]),
        )
    
        data = {
          '__VIEWSTATE': viewstate,
          '__VIEWSTATEGENERATOR': viewstategenerator,
          '__EVENTVALIDATION': eventvalidation,
          'butAgree': 'OK'
        }
    
        response = requests.post('https://dibbs2.bsm.dla.mil/dodwarning.aspx', headers=headers, params=params, data=data)
    
        with open(url.rsplit('/', 1)[1], 'wb') as f:
            f.write(response.content)
            
    for i in test_list:
        get_file(i)
        time.sleep(1)
    

    【讨论】:

    • 处理了六个文件,有很多文件,显示requests.exceptions.ConnectionError: HTTPSConnectionPool(host='dibbs2.bsm.dla.mil', port=443): Max retries exceeded with url: /dodwarning.aspx?goto=%2FDownloads%2F RFQ%2F0%2FSPE4A722Q0200.PDF (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002173433ED30>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
    • time.sleep 设置为 2、3 或更高。如此多的请求已达到网站的速率限制。
    • 如果我必须下载 10000 个文件怎么办,然后它会显示错误
    • 这一切都取决于(未知)速率限制。如果网站将每个 IP 地址的下载次数限制为每天 1,000 次怎么办?您可以创建一个 try-except 在设定的时间间隔后继续下载,或使用代理。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-04-03
    • 2017-04-27
    • 1970-01-01
    • 2023-03-04
    • 1970-01-01
    相关资源
    最近更新 更多