文章目录
技术框架
采用scrapy 以及 scrapy-redis,以redis为调度,分布式爬取淘宝。
防止被禁的技巧
本例中采用每次请求换user-agent,以及禁用cooikes,30秒换一个ip(非常规ip代理)此技术自行搜索,
settings.py
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
DOWNLOADER_MIDDLEWARES = {
'taobao_crawler.middlewares.TaobaoCrawlerDownloaderMiddleware': 543,
'taobao_crawler.middlewares.UserAgentMiddleware': 401,
'taobao_crawler.middlewares.ProxyMiddleware': 100,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
ITEM_PIPELINES = {
'taobao_crawler.pipelines.TaobaoCrawlerPipeline': 300,
}
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
DUPEFILTER_CLASS = "taobao_crawler.dupefilter.MyRFPDupeFilter" #scarpy去重设置
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_START_URLS_AS_SET = True
REDIS_HOST=127.0.0.1 #分布式服务器
REDIS_PORT = 6379#6379
PROXY = 'http://127.0.0.1:8118'#代理服务器端口
EXTENSIONS = {
'scrapy.telnet.TelnetConsole': None
}#服务器多开scarpy的设置
middlewares.py
其中UserAgentMiddleware是伪装user-agent,ProxyMiddleware是使用ip代理访问淘宝。agents是一个user-agent列表,其中UserAgentMiddleware,ProxyMiddleware在middlewares.py代码如下:
class UserAgentMiddleware(object):
"""
换 usergent
"""
def process_request(self, request, spider):
from .user_agents import agents
agent = random.choice(agents)
request.headers['User-Agent'] = agent
# spider.logger.info(agent)
class ProxyMiddleware(object):
# 单个代理
def __init__(self, settings):
self.proxy = settings.get('PROXY')
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def process_request(self, request, spider):
request.meta['proxy'] = self.proxy
dupefilter.py
其中MyRFPDupeFilter在继承scarpy-redis的去重,结合scrapy_splash的去重在spiders文件下的dupefilter.py中
此例中只需scarpy-redis,scrapy_splash后面弃用。
from scrapy_redis.dupefilter import RFPDupeFilter
from scrapy_splash.dupefilter import splash_request_fingerprint
class MyRFPDupeFilter(RFPDupeFilter):
def request_fingerprint(self, request):
return splash_request_fingerprint(request)
整体项目代码
项目示意图
代码实例
pipelines.py
class TaobaoCrawlerPipeline(object):
def __init__(self):
client = pymongo.MongoClient('127.0.0.1', 27018)
db = client['TaobaoDatas']
self.tv = db['tvs'] # 电视机
def process_item(self, item, spider):
if isinstance(item,TaobaoCrawlerItem):
try:
self.taobao.insert(dict(item))
except Exception as e :
#TODO 进行错误存储并加入队列重新进行爬取
spider.logger.info('error: %s' % e)
return item
quickstart.py
from scrapy import cmdline
cmdline.execute("scrapy crawl taobao".split())
然后直接运行python quickstart.py 即可
有时间更新项目地址