如何处理大规模网页抓取？答案

【问题标题】：How to handle large scale Web Scraping?如何处理大规模网页抓取？
【发布时间】：2021-12-11 15:19:14
【问题描述】：

情况：

我最近开始使用 selenium 和 scrapy 进行网络抓取，我正在做一个项目，我有一个包含 42000 个邮政编码的 csv 文件，我的工作是获取该邮政编码并继续 site 输入邮政编码编码并抓取所有结果。

问题：

这里的问题是，在执行此操作时，我必须不断单击“加载更多”按钮，直到显示所有结果，并且只有在完成后我才能收集数据。

这可能不是什么大问题，但是每个邮政编码需要 2 分钟，而我有 42000 人可以这样做。

守则：

    import scrapy
    from numpy.lib.npyio import load
    from selenium import webdriver
    from selenium.common.exceptions import ElementClickInterceptedException, ElementNotInteractableException, ElementNotSelectableException, NoSuchElementException, StaleElementReferenceException
    from selenium.webdriver.common.keys import Keys
    from items import CareCreditItem
    from datetime import datetime
    import os
    
    
    from scrapy.crawler import CrawlerProcess
    global pin_code
    pin_code = input("enter pin code")
    
    class CareCredit1Spider(scrapy.Spider):
        
        name = 'care_credit_1'
        start_urls = ['https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty//?Sort=D&Radius=75&Page=1']
    
        def start_requests(self):
            
            directory = os.getcwd()
            options = webdriver.ChromeOptions()
            options.headless = True
    
            options.add_experimental_option("excludeSwitches", ["enable-logging"])
            path = (directory+r"\\Chromedriver.exe")
            driver = webdriver.Chrome(path,options=options)
    
            #URL of the website
            url = "https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty/" +pin_code + "/?Sort=D&Radius=75&Page=1"
            driver.maximize_window()
            #opening link in the browser
            driver.get(url)
            driver.implicitly_wait(200)
            
            try:
                cookies = driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')
                cookies.click()
            except:
                pass
    
            i = 0
            loadMoreButtonExists = True
            while loadMoreButtonExists:
                try:
                    load_more =  driver.find_element_by_xpath('//*[@id="next-page"]')
                    load_more.click()    
                    driver.implicitly_wait(30)
                except ElementNotInteractableException:
                    loadMoreButtonExists = False
                except ElementClickInterceptedException:
                    pass
                except StaleElementReferenceException:
                    pass
                except NoSuchElementException:
                    loadMoreButtonExists = False
    
            try:
                previous_page = driver.find_element_by_xpath('//*[@id="previous-page"]')
                previous_page.click()
            except:
                pass
    
            name = driver.find_elements_by_class_name('dl-result-item')
            r = 1
            temp_list=[]
            j = 0
            for element in name:
                link = element.find_element_by_tag_name('a')
                c = link.get_property('href')
                yield scrapy.Request(c)
    
        def parse(self, response):
            item = CareCreditItem()
            item['Practise_name'] = response.css('h1 ::text').get()
            item['address'] = response.css('.google-maps-external ::text').get()
            item['phone_no'] = response.css('.dl-detail-phone ::text').get()
            yield item
    now = datetime.now()
    dt_string = now.strftime("%d/%m/%Y")
    dt = now.strftime("%H-%M-%S")
    file_name = dt_string+"_"+dt+"zip-code"+pin_code+".csv"
    process = CrawlerProcess(settings={
        'FEED_URI' : file_name,
        'FEED_FORMAT':'csv'
    })
    process.crawl(CareCredit1Spider)
    process.start()
    print("CSV File is Ready")

items.py


    import scrapy

    class CareCreditItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        Practise_name = scrapy.Field()
        address = scrapy.Field()
        phone_no = scrapy.Field()

问题：

基本上我的问题很简单。有没有办法优化此代码以使其执行得更快？或者还有哪些其他潜在的方法可以处理抓取这些数据而无需花费太多时间？

【问题讨论】：

您是否在使用任何其他外部资源？因为您可以使用例如来自 aws 的 ec2 实例将种子列表分布在这些实例上，甚至可能发现实例以并行运行刮板，同时刮取许多邮政编码
感谢您回复@Kwsswart，但我不太明白您在说什么，您能稍微解释一下或给我一些参考链接以了解/
基本上如果你将邮政编码（种子列表）分成许多单独的种子列表并开发一个解决方案来运行相同的蜘蛛，但在许多不同的机器上使用不同的种子列表（aws instances 是一个例子）那么你可以本质上，有许多机器同时使用原始种子列表的不同部分进行处理
哦，是的，这是解决这个问题的好方法，非常感谢您的帮助，还有其他方法可以像不部署一样进行大规模抓取吗？
有很多大规模的方法，特别是在 AWS 中使用 Lambda 或 ec2 等。但是如果你只想使用一台机器，你可以研究多线程（使用所述 pc 中的所有处理器同时运行程序。）或者考虑在单个进程上连续（尽管缓慢）运行它。您也可以尝试简单地使用请求来运行它，这可能会加快速度，但使用大量种子通常会更快地开发并行运行的进程

标签： python selenium web-scraping scrapy

【解决方案1】：

有多种方法可以做到这一点。

1.创建一个分布式系统，您可以在其中通过多台机器运行蜘蛛，以便并行运行。

在我看来，这是更好的选择，因为您还可以创建一个可以多次使用的可扩展动态解决方案。

通常有很多方法可以做到这一点，它将种子列表（邮政编码）分成许多单独的种子列表，以便让单独的进程处理单独的种子列表，因此下载将并行运行，例如，如果它在 2 台机器上运行速度会快 2 倍，但如果在 10 台机器上运行速度会快 10 倍，等等。

为了做到这一点，我可能会建议研究 AWS，即 AWS Lambda 、 AWS EC2 Instances 甚至 AWS Spot Instances 这些是我以前使用过的，它们并不是很难使用。

2。或者，如果您想在单台机器上运行它，您可以查看Multithreading with Python，它可以帮助您在单台机器上并行运行该过程。

3.这是另一种选择，特别是如果它是一次性过程。您可以尝试简单地使用请求来运行它，这可能会加快速度，但如果使用大量种子，开发并行运行的进程通常会更快。

【讨论】：

您将如何处理这些方法中的速率限制？

【解决方案2】：

由于站点从api 动态加载数据，您可以直接从 api 检索数据。这会加快速度，但我仍然会执行等待以避免达到速率限制。

import requests
import time
import pandas as pd

zipcode = '00704'
radius = 75
url = f'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={zipcode}&City=&State=&Lat=&Long=&Sort=D&Radius={radius}&PracticePhone=&Profession=&location={zipcode}&Page=1'
req = requests.get(url)
r = req.json()
data = r['results']

for i in range(2,r['maxPage']+1):
    url = f'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={zipcode}&City=&State=&Lat=&Long=&Sort=D&Radius={radius}&PracticePhone=&Profession=&location={zipcode}&Page={i}'
    req = requests.get(url)
    r = req.json()
    data.extend(r['results'])
    time.sleep(1)

df = pd.DataFrame(data)
df.to_csv(f'{pd.Timestamp.now().strftime("%d/%m/%Y_%H-%M-%S")}zip-code{zipcode}.csv')

【讨论】：

感谢您的回复，我尝试运行您的代码，但出现此错误：- json.decoder.JSONDecodeError：期望属性名称用双引号括起来：第 6 行第 2 列（字符 11）
@Samyakjain 做了一些小改动，现在应该可以正常运行了
非常感谢@RJ Adriaansen 在进行了一些更改之后，您的代码现在可以正常运行，但是您能否在摘要中解释一下我没有得到的东西，但这段代码非常好，再次感谢我试图找到来自检查元素的 json 文件，但我无法在那里找到它。
在加载页面时勾选network activity可以找到api链接。在 json 数据中，maxPage 键表示页数。我首先检索了第一页，将results 键保存为data，然后循环其他页面并将新结果添加到data。最后使用 pandas 将列表列表保存到 csv。