Python爬虫爬取博客园并保存
爬取博客园指定用户的文章修饰后全部保存到本地
首先定义爬取的模块文件:
- crawlers_main.py 执行入口
- url_manager.py url管理器
- download_manager.py 下载模块
- parser_manager.py html解析器(解析html需要利用的内容)
- output_manager.py 输出html网页全部内容文件(包括css,png,js等)
crawlers_main.py 执行入口
1 # coding:utf8 2 from com.crawlers import download_manager 3 from com.crawlers import output_manager 4 from com.crawlers import parser_manager 5 from com.crawlers import url_manager 6 7 8 class SpiderMain(object): 9 def __init__(self): 10 self.urls = url_manager.UrlManager() 11 self.downloader = download_manager.DownloadManager() 12 self.parser = parser_manager.ParserManager() 13 self.output = output_manager.OutputManager() 14 15 def craw(self, root_url): 16 html_root = self.downloader.download(root_url) 17 new_urls = self.parser.parseUrls(root_url,html_root) 18 self.urls.add_new_urls(new_urls) 19 count = 1 20 while self.urls.has_new_url(): 21 try: 22 new_url = self.urls.get_new_url() 23 print('craw %d : %s' % (count, new_url)) 24 html_cont = self.downloader.download(new_url) 25 new_data = self.parser.parse(new_url, html_cont) 26 self.output.collect_data(new_data) 27 if count == 1000: 28 break 29 count += 1 30 except: 31 print('craw failed') 32 33 self.output.output_html() 34 35 36 if __name__ == "__main__": 37 root_url = "http://www.cnblogs.com/zhuyuliang/" 38 obj_spider = SpiderMain() 39 obj_spider.craw(root_url)