Python爬虫爬取博客园并保存       

爬取博客园指定用户的文章修饰后全部保存到本地

首先定义爬取的模块文件:

  1. crawlers_main.py 执行入口
  2. url_manager.py url管理器
  3. download_manager.py 下载模块
  4. parser_manager.py html解析器(解析html需要利用的内容)
  5. output_manager.py 输出html网页全部内容文件(包括css,png,js等)

Python爬虫爬取博客园并保存

 

crawlers_main.py 执行入口

 1 # coding:utf8
 2 from com.crawlers import download_manager
 3 from com.crawlers import output_manager
 4 from com.crawlers import parser_manager
 5 from com.crawlers import url_manager
 6 
 7 
 8 class SpiderMain(object):
 9     def __init__(self):
10         self.urls = url_manager.UrlManager()
11         self.downloader = download_manager.DownloadManager()
12         self.parser = parser_manager.ParserManager()
13         self.output = output_manager.OutputManager()
14 
15     def craw(self, root_url):
16         html_root = self.downloader.download(root_url)
17         new_urls = self.parser.parseUrls(root_url,html_root)
18         self.urls.add_new_urls(new_urls)
19         count = 1
20         while self.urls.has_new_url():
21             try:
22                 new_url = self.urls.get_new_url()
23                 print('craw %d : %s' % (count, new_url))
24                 html_cont = self.downloader.download(new_url)
25                 new_data = self.parser.parse(new_url, html_cont)
26                 self.output.collect_data(new_data)
27                 if count == 1000:
28                     break
29                 count += 1
30             except:
31                 print('craw failed')
32 
33         self.output.output_html()
34 
35 
36 if __name__ == "__main__":
37     root_url = "http://www.cnblogs.com/zhuyuliang/"
38     obj_spider = SpiderMain()
39     obj_spider.craw(root_url)
crawlers_main.py

相关文章: