爬虫学习笔记

爬虫心事123：从刚认识Python到现在，目光所及之处必有爬虫文章。我却一直不感兴趣，一是网上爬取数据并分析的文章不光展现了技术实力，还有作者思路清晰分析的头头是道，作文水品也使我惭愧，所以我不愿沦为黑客盛行时代的脚本小子之流。二是以前公司需要新闻采集，那时不懂爬虫之类，便用PHP写了采集新闻的页面，填写URL，左右标签过滤之后，拿新闻列表和新闻。知道有爬虫之后，就认为我用PHP写的也算最简陋的爬虫，便对爬虫没有多大的兴趣。

就像初中周董火的一塌糊涂，我却不愿意听，除了资源难获取外主要还是抗拒大家都在狂热的东西就想做不一样的人。然而，现在还能想起有一年的暑假作业最后面带有七里香的歌词 窗外的麻雀在电线杆上多嘴你说这一句很有夏天的感觉，偷偷听着周董补青春。

1. hello world 与 Requests 库

Requests 唯一的一个非转基因的 Python HTTP 库，人类可以安全享用。

爬取百度首页

# coding:utf-8
import requests

resp = requests.get(\'https://www.baidu.com\')
html = resp.text.encode(\'ISO-8859-1\').decode(\'utf-8\')
print(resp.encoding)
print(html)

刚开始中文字为乱码，以为是gbk便转了一次，还是不行，直到在python爬虫编码彻底解决中知道requests.Response 类型的 encoding属性可以得到编码，输出是ISO-8859-1编码。chardet 之流的还是不用了。

如果报SSL错误，resp = requests.get(\'https://www.baidu.com\', verify=False) 添加verify=False忽略证书即可。

图片二进制文件获取办法

# coding:utf-8
import requests
import os
import sys

url = \'http://docs.python-requests.org/zh_CN/latest/_static/requests-sidebar.png\'
resp = requests.get(url)

img_name = os.path.basename(url)
img_path = sys.path[0]+\'/\'+img_name
# print(resp.content)
with open(img_path, \'wb\') as f:
    f.write(resp.content)
    print(f"download image file : {img_path}")

将二进制写入文件。

response.content 和 response.text 区别

上面两次分别用了resp.text和resp.content获取响应数据，那么区别在哪？

response.content获取的是二进制文件，resp.text获取文本。

python response.text 和response.content的区别中摘录：

response.content

- 类型：bytes
- 解码类型： 没有指定
- 如何修改编码方式：response.content.deocde("utf-8")

response.text

- 类型：str
- 解码类型： 根据HTTP 头部对响应的编码作出有根据的推测，推测的文本编码
- 如何修改编码方式：response.encoding="gbk"

勤查手册

那么response除了上面用过的encoding content text 外还有什么属性？去官网入门能手册看看 Requests 快速上手，很短很简洁很清晰。

从里面了解到：

你可以找出 Requests 使用了什么编码，并且能够使用r.encoding 属性来改变它：

>>> r.encoding
\'utf-8\'
>>> r.encoding = \'ISO-8859-1\'

如果你改变了编码，每当你访问 r.text ，Request 都将会使用 r.encoding 的新值。

所以第一个例子中，获取百度首页中html = resp.text.encode(\'ISO-8859-1\').decode(\'utf-8\')用ISO-8859-1编码再解码是不是有点多余，试着指定encoding后再获取内容。

resp = requests.get(\'https://www.baidu.com\', verify=False)
# html = resp.text.encode(\'ISO-8859-1\').decode(\'utf-8\')
# print(resp.encoding)
# print(html)
resp.encoding = \'utf-8\'
print(resp.text)

可以获取！所以有一份清晰的手册是多么的有用。感谢Requests各位作者 ❤

2. Beautiful Soup 库提取数据

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse\'s story</title></head>
<body>
<p class="title"><b>The Dormouse\'s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, \'lxml\')
# print(soup.prettify())
# print(soup.get_text())

# 获取标题标签、标签名、标题内容
title_tag = soup.title
tag_name = soup.title.name
title = soup.title.string
print(title_tag)
print(tag_name)
print(title)

print(soup.find_all(\'a\'))

因为没有指定解析器，系统中安装了lxml，所以有警告说默认使用系统中已有最佳可用的HTML解析器 lxml，但是因为可能其他机器没有，所以要注意移植性啦。

UserWarning: No parser was explicitly specified, so I\'m using the best available HTML parser for this system ("lxml").

Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.

获取标签和文本：

from bs4 import BeautifulSoup

lists = """
<a class="a1" href="http://www.baidu.com/">百度</a>, 
<a class="a2" href="http://www.163.com/">网易</a>, 
<a class="a3" href="http://www.sina.com/">新浪</a>
"""
b = BeautifulSoup(lists, \'lxml\')
a_list = b.find_all(\'a\')

print(a_list)
for a in a_list:
    print(a.get_text())  # get_text 获取标签文本
    print(a.get(\'href\'))  # get 获取标签属性
    print(a[\'href\'])  # 同上

获取教程页面的Top5 <code>元素(md中的 `):

# coding:utf-8
import requests
from bs4 import BeautifulSoup
from collections import Counter


def get_count(data):
    """ 获取元素出现次数最多的前5名 """
    count = Counter(data)
    count = count.most_common(5)
    return dict(count)


html = requests.get(\'https://www.yukunweb.com/2017/6/python-spider-BeautifulSoup-basic/\')
content = html.text

soup = BeautifulSoup(content, \'lxml\')
codes = soup.find_all(\'code\')
top5 = get_count(codes)
print(top5)

页面出现Top5的code元素和次数：

{BeautifulSoup: 12, find_all(): 12, lxml: 9, Python: 8, find(): 6}

注：css选择器select()很强大，一般都可以匹配到，若不熟悉，可使用Chrome小功能：元素右击->copy->Copy selector

拿到 #\31 763845315 > div.reply-doc.content > p，只用div.reply-doc.content > p即可。

用的时候还是多Beautiful Soup 4.2.0 文档，有一点点长，抽时间看看就会得心应手。

实战：下载红楼梦

想要下载红楼，找到一个红楼梦资源，查看源码：

列表在download_list下，download_title下的a标签为每条的链接。用soup.select(\'.download_list .download_title > a\')获取每条链接。就可以拿到每条ed2k磁力链

红楼梦.02.宝黛钗初会荣庆堂.mkv ed2k://|file|%E7%BA%A2%E6%A5%BC%E6%A2%A6.02.%E5%AE%9D%E9%BB%9B%E9%92%97%E5%88%9D%E4%BC%9A%E8%8D%A3%E5%BA%86%E5%A0%82.mkv|922919381|4458969531F1D4153EAB37F1E80F4AC2|/
红楼梦.23.慧紫娟情辞试忙玉.mkv ed2k://|file|%E7%BA%A2%E6%A5%BC%E6%A2%A6.23.%E6%85%A7%E7%B4%AB%E5%A8%9F%E6%83%85%E8%BE%9E%E8%AF%95%E5%BF%99%E7%8E%89.mkv|907433268|2AFA1AE36A5BC34F76CF07621B0D00F5|/
……

只打印链接然后复制到迅雷也可以，但是找到一个迅雷下载

# coding:utf-8
""" 调用迅雷下载 """
import subprocess
import base64
thunder_path = r\'C:\Program Files (x86)\Thunder Network\Thunder\Program\Thunder.exe\'


def Url2Thunder(url):
    url = \'AA\' + url + \'ZZ\'
    url = base64.b64encode(url.encode(\'ascii\'))
    url = b\'thunder://\' + url
    thunder_url = url.decode()
    return thunder_url


def download_with_thunder(file_url):
    thunder_url = Url2Thunder(file_url)
    subprocess.call([thunder_path, thunder_url])

拿来集成下载之后，直接执行就能看到调用下载列表了：

注：ed2k不需要通过Url2Thunder(url)转成thunder地址，但是转了也不影响。

简单封装一次使调用更简单，而且要可限制条数，像limit，用切片完成。比如第二部我想下载风骚律师，但是资源里面有4季，而我只想要第三季的十集。

if __name__ == \'__main__\':
    # 红楼
    hl_down = Download(file_name=\'hl.html\', encoding=\'GB2312\')  # 第二次可加debug=True, 用文件调试，避免直接请求
    hl_down.get_video(
        \'https://m.2011mv.com/res/6154/\',
        \'.download_list .download_title > a\',
        \'ed2k\')

    # 风骚律师 第三季
    hl_down = Download(file_name=\'fxlo3.html\', encoding=\'GB2312\')
    hl_down.get_video(
        \'https://m.2011mv.com/res/13969/\',
        \'.introtext table a\',
        \'href\',
        10, 20)

可以看到刚好抓取的是第三季十集，不过资源本身有问题最后不能下载，对于这篇本身没有影响。

3. 抓取土味情话（正则和bs4分别解析页面）

怎么在豆瓣贴回复中找到需要的土味情话？取字数，不妥，试试情感分析吧。用snownlp库。

获取当前页所有评论：

soup = BeautifulSoup(html, \'lxml\')
content = soup.select(\'#comments div.reply-doc.content > p\')

加了#commentsid筛选评论，排除高赞的重复数据。

结果：

可我想和你结尾
喜欢我吗，喜欢我就发豆邮给我 (ˊo̴̶̤⌄o̴̶̤ˋ)
撩
楼下继续
我有两把枪，一把叫射，另一把叫啊，美极了!
楼下接
我是灵儿你是什么呀
我不知道  哈哈哈
你是叮当呀
楼下继续
被你点赞的朋友圈是甜甜圈

1. hello world 与 Requests 库

2. Beautiful Soup 库 提取数据

实战：下载红楼梦

3. 抓取土味情话（正则和bs4分别解析页面）

2. Beautiful Soup 库提取数据