本章学习内容:
1、网页编码还原读取
2、功能设计
stuep1:网页编码还原读取
本次抓取对象:
http://www.cuiweijuxs.com/jingpinxiaoshuo/
按照第一篇的代码来进行抓取:
# -*- coding: UTF-8 -*-
from urllib import request
if __name__ == "__main__":
chaper_url = "http://www.cuiweijuxs.com/jingpinxiaoshuo/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = request.Request(url=chaper_url, headers=headers)
response = request.urlopen(req)
html = response.read()
print(html)
打印出
b'<!doctype html>\r\n<html>\r\n<head>\r\n<title>\xbe\xab\xc6\xb7\xd0\xa1\xcb\xb5_………………
这样的内容,这个是编码格式的问题,在zipfile解压乱码的文章中已经说过了,所以需要先看下这个html网页的头部,看到编码格式是gbk
具体看http://www.cnblogs.com/yaoshen/p/8671344.html
另外一种程序检测方法是使用chardet(非原生库,需要安装),
charset = chardet.detect(html)
print(charset)
检测内容:{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
如果使用GB2312来解码是有问题的,尝试过后发现还是gbk比较有效,包含字符多一点
改写代码如下:
html = html.decode('GBK') #except: # html = html.decode('utf-8') print(html)
完整代码如下:
# -*- coding: UTF-8 -*- from urllib import request import chardet if __name__ == "__main__": chaper_url = "http://www.cuiweijuxs.com/jingpinxiaoshuo/" headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'} req = request.Request(url=chaper_url, headers=headers) response = request.urlopen(req) html = response.read() print(html) # 查看网页编码格式 charset = chardet.detect(html) print(charset) # 查看网页内容 #try: html = html.decode('GBK') #except: # html = html.decode('utf-8') print(html)