python——博客园首页信息提取与分析

前言

前两天写了博客，然后发到了博客园首页，然后看着点击量一点点上升，感觉怪怪的。

然后就产生了一点好奇：有多少人把博客发表到了首页上？每天博客园首页会发表多少文章？谁发表的文章最多？评论和阅读数量的对应关系是多少？

有了好奇之后，就在想，怎样才能知道答案？

1. 寻路第一步

通过浏览博客园发现，在博客园最多能看到200页。所以，能不能先把这200页给下载下来。之前有写过一篇博客，批量下载图片，所以可以用博客中类似的方法把这些网页下载下来。

from html.parser import HTMLParser
import os,urllib.request,sys

#通过博客园NEXT按钮，可以获取下一个网页的地址，这样一直循环，就可以将200个网页下载下来。

#setp 1. 通过解析网页，获取下一个网页的地址。
class LinkParser(HTMLParser):
  def __init__(self,strict=False,domain=''):
    HTMLParser.__init__(self,strict)
    self.value=''
    self.domain=domain
    self.next=[]
  def handle_starttag(self,tag,attrs):
    if tag=='a':
      for i in attrs:
        if i[0]=='href':
          self.value=i[1]
  def handle_data(self,data):
    if data.startswith('Next'):
      if (self.domain!='' )and ('://' not in self.value):
        self.next.append(self.domain+self.value)
      else:
        self.next.append(self.value)

#setp 2. 下载当前网页，并根据解析结果，下载下一个网页。
def getLinks(url,domain):
  doing=[url]
  done=[]
  cnt=0;
  while len(doing)>=1:
    x=doing.pop();
    done.append(x)
    cnt=cnt+1;
    print('start:',x)
    try:
      f=urllib.request.urlopen(x,timeout=120)
      s=f.read()
      f.close()
      fx=open(os.path.join(os.getcwd(),'data','{0}.html'.format(str(cnt))),'wb') #需要在当前目录建立data文件夹
      fx.write(s)
      fx.close()
      parser=LinkParser(strict=False,domain=domain)
      parser.feed(s.decode())
      for i in parser.next:
        if i not in done:
          doing.insert(0,i)
      parser.next=[]
      print('ok:',x)
    except:
      print('error:',x)
      print(sys.exc_info())
      continue
  return done

if __name__=='__main__':
  getLinks('http://www.cnblogs.com/','http://www.cnblogs.com/')

2. 从网页抽取信息

网页已经下载下来了，现在需要把信息从网页上抽取出来。

经过分析，每个网页上列出了20条记录，每条记录包含标题，作者，发布时间，推荐等信息。

怎样把这些给抽取出来呢？

先写一个小的程序，看看Python是怎么解析这些数据的：

数据：

<html>
<head></head>
<body>
<div class="post_item">
<div class="digg">
    <div class="diggit" onclick="DiggIt(3266366,130739,1)"> 
    <span class="diggnum" id="digg_count_3266366">10</span>
    </div>
    <div class="clear"></div>    
    <div id="digg_tip_3266366" class="digg_tip"></div>
</div>      
<div class="post_item_body">
    <h3><a class="titlelnk" href="http://www.cnblogs.com/ola2010/p/3266366.html" target="_blank">python——常用功能之文本处理</a></h3>                   
    <p class="post_item_summary">
    前言在生活、工作中，python一直都是一个好帮手。在python的众多功能中，我觉得文本处理是最常用的。下面是平常使用中的一些总结。环境是python 3.30. 基础在python中，使用str对象来保存字符串。str对象的建立很简单，使用单引号或双引号或3个单引号即可。例如：s='nice' ... 
    </p>              
    <div class="post_item_foot">                    
    <a href="http://www.cnblogs.com/ola2010/" class="lightblue">ola2010</a> 
    发布于 2013-08-18 21:27 
    <span class="article_comment"><a href="http://www.cnblogs.com/ola2010/p/3266366.html#commentform" title="2013-08-20 17:45" class="gray">
        评论(4)</a></span><span class="article_view"><a href="http://www.cnblogs.com/ola2010/p/3266366.html" class="gray">阅读(1640)</a></span></div>
</div>
<div class="clear"></div>
</div>
</body>
</html>

View Code