爬虫学习（一）

一、所需模块requests

requests 的简介

　　Requests是用python语言基于urllib编写的，采用的是Apache2 Licensed开源协议的HTTP库，Requests它会比urllib更加方便，可以节约我们大量的工作。

安装requests方法

　　在电脑终端 pip install requests,（最好在Python安装目录里安装）。

requests的常用方法

1 获取get请求
2 requests.get(url=\'xxx\')#等价于 requests.request(method=\'get\',url=\'xxx\')
3 获取post请求
4 requests.post(url=\'xxx\')  #等价于requests.request(method=\'post\',url=\'xxx\')

示例：获取get

1 import requests
2 requests.get(
3     url=\'http://www.oldboyedu.com\',
4     params={\'nid\':1,\'name\':\'x\'},  #发送时完整是  http://www.oldboyedu.com？nid=1&name=x
5     headers={},
6     cookies={},
7 )

示例：获取post

1 import requests
2 requests.post(
3     url=\'x\',
4     data={},
5     headers={},
6     cookies={},
7 )

get和post括号里可以跟以下参数

1 url:地址;

2 params:在url中传参;

3 headers：请求头;

4 cookies：cookie;

5 data:数据;

post请求里的data数据可有两种形式的发送：

（1）、

1 requests.post(
2     url=\'http://www.oldboyedu.com\',
3     data={
4         \'name\':\'hahaha\',
5         \'age\':18
6     },
7     headers={},
8     cookies={},
9 )

发送的时候是以name=hahaha&age=18形式发送的。

1 requests.post(
2     url=\'http://www.oldboyedu.com\',
3     data=json.dumps({
4         \'name\':\'hahaha\',
5         \'age\':19
6     }),
7     headers={},
8     cookies={},
9 )

发送的时候是以字符串形式发送的。\'{\'name\':\'hahaha\',\'age\':19}\'

requests的其他请求补充

1 requests.get(url, params=None, **kwargs)
2 requests.post(url, data=None, json=None, **kwargs)
3 requests.put(url, data=None, **kwargs)
4 requests.head(url, **kwargs)
5 requests.delete(url, **kwargs)
6 requests.patch(url, data=None, **kwargs)
7 requests.options(url, **kwargs)

 1 # 以上requests的方法都是基于requests.request()
 2 上述的方法都是在 requests.request()构建而成的：
 3 requests.request()
 4         - method：提交方式，post，get，delete， put， head， patch， options
 5         - url： 提交地址
 6         - params： 在url中传递参数，GET params = {k1:vi}
 7         - data: 在请求体里传递参数用于post请求 data = {k1:v1,k2:v2} or \'k1=v1&k2=v2\'
 8         - json: 在请求体里传递参数，并在请求头中设置content-type： application/json
 9         - headers： 在请求头中添加数据
10         - cookies: 网站cookies 在请求头中
11         - files : 文件对象{\'f1\': open(\'s1.py\', wb), \'f2\': (\'上传到服务器的文件名\', oprn(\'s1.py\', wb))}
12         - auth : 认证使用 在 请求头中加入用户名密码
13         - timeout ： 超时时间
14         - allow_redirects: 是否允许重定向 bool
15         - proxies: 代理
16         = stream: 流,bool   用于下载文件
17             ret = request.get(\'http://127.0.0.1:8888/test/\', steam=True)
18             for i in ret.iter.content():
19                 print(i)
20          - cert: 证书 指定https SSL证书文件
21          - verify = False https忽略证书存在

获取请求结果的几种方式：

1 respone.text # 返回str类型
2 respone.content # 返回字节类型
3 response.encoding # 指定Response的编码
4 response.apparent_encoding # 返回改respones的编码
5 response.cookies,get_dict() # 获取cookie的字典形式

requests的高阶函数应用：

1 request.Session: 自动管理Cookies信息
2 ret = request.Session()
3 ret.get(\'https://www.baidu.com\')

二、beautifulsoup4模块：（用来解析response）

安装bs4:

1 　　pip3 install beautifulsoup4

bs4的简析及使用

　导入方式：

1 from bs4 import BeautifulSoup

find_all 获取所有的匹配的标签

 1 from bs4 import BeautifulSoup
 2 soup = BeautifulSoup(ret.text,\'html.parser\')  #将前端的字符串解析出来
 3 tags = soup.find_all(\'a\')   #获取列表
 4 print(tags)
 5 
 6 tags = soup.find_all(\'a\',limit=1)
 7 print(tags)
 8 
 9 tags = soup.find_all(name=\'a\', attrs={\'class\': \'sister\'}, recursive=True, text=\'Lacie\')
10 # tags = soup.find(name=\'a\', class_=\'sister\', recursive=True, text=\'Lacie\')
11 print(tags)
12 
13 
14 # ####### 列表 #######
15 v = soup.find_all(name=[\'a\',\'div\'])
16 print(v)
17 
18 v = soup.find_all(class_=[\'sister0\', \'sister\'])
19 print(v)
20 
21 v = soup.find_all(text=[\'Tillie\'])
22 print(v, type(v[0]))
23 
24 
25 v = soup.find_all(id=[\'link1\',\'link2\'])  #同时获取多个ID属性
26 print(v)
27 
28 v = soup.find_all(href=[\'link1\',\'link2\'])
29 print(v)
30 
31 # ####### 正则 #######
32 import re
33 rep = re.compile(\'p\')
34 rep = re.compile(\'^p\')
35 v = soup.find_all(name=rep)
36 print(v)
37 
38 rep = re.compile(\'sister.*\')
39 v = soup.find_all(class_=rep)
40 print(v)
41 
42 rep = re.compile(\'http://www.oldboy.com/static/.*\')
43 v = soup.find_all(href=rep)
44 print(v)
45 
46 # ####### 方法筛选 #######
47 def func(tag):
48     return tag.has_attr(\'class\') and tag.has_attr(\'id\')
49     v = soup.find_all(name=func)
50     print(v)
51 
52 
53 # ## get,获取标签属性
54 tag = soup.find(\'a\')
55 v = tag.get(\'id\')
56 print(v)

get_text 获取标签内部文本内容

1 from bs4 import BeautifulSoup
2 soup = BeautifulSoup(ret.text,\'html.parser\')
3 tag = soup.find(\'a\')
4 v = tag.get_text(\'id\')
5 print(v)

find 找到匹配的第一个标签

1 tag = soup.find(\'a\')
2 print(tag)
3 tag = soup.find(name=\'a\', attrs={\'class\': \'sister\'}, recursive=True, text=\'Lacie\')
4 tag = soup.find(name=\'a\', class_=\'sister\', recursive=True, text=\'Lacie\')#这种写法与上面效果一样
5 print(tag)

更多方法可参考

http://www.cnblogs.com/wupeiqi/articles/6283017.html

更多参数官方：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

三、示例

爬取汽车之家的新闻

import requests
from bs4 import BeautifulSoup
ret = requests.get(url=\'https://www.autohome.com.cn/news/\')
print(ret)  #仅仅返回一个对象
print(ret.apparent_encoding)   # 查看爬取的内容是以什么方式编码的
# print(ret.content)   #以字节形式打印出来
# ret.encoding=\'gbk\'    #以gbk的方式解码
# print(ret.content)
ret.encoding=ret.apparent_encoding    #以编码的方式解码
# print(ret.text)   #打印文本内容

soup = BeautifulSoup(ret.text,\'html.parser\')
# print(soup)
print(type(soup))

div = soup.find(name=\'div\',id=\'auto-channel-lazyload-article\')
li_list = div.find_all(name=\'li\')
for li in li_list:
    h3 = li.find(name=\'h3\')
    if not h3:   #如果不是h3标签就跳过
        continue
    print(h3)
    p = li.find(name=\'p\')
    print(p)
    a = li.find(\'a\')
    # print(a)
    print(a.attrs)
    print(a.get(\'href\'))

    img = li.find(name=\'img\')
    src = img.get(\'src\')
    print(src)
    #第二次访问
    file_name = src.rsplit(\'__\',maxsplit=1)[1]  #设置文件名
    ret_src = requests.get(url=\'https:\'+src)
    with open(file_name,\'wb\') as f:
        f.write(ret_src.content)
    print(\'=\'*30)

给抽屉点一个赞

 1 import requests
 2 from bs4 import BeautifulSoup
 3 #
 4 #1、先访问抽屉新热榜，获取未授权的cookie
 5 ret = requests.get(
 6     url=\'https://dig.chouti.com/all/hot/recent/1\',
 7     headers={
 8         \'User-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\',
 9     },
10 )
11 ret_cookie_dict = ret.cookies.get_dict()
12 
13 
14 
15 # 2、发送用户名和密码认证 + cookie（未授权）
16 
17 
18 #以下这是post的方式请求
19 response_login = requests.post(
20     url=\'https://dig.chouti.com/login\',
21     data={
22         \'phone\':\'8616601049889\',
23         \'jid\':\'国歌\',
24         \'password\':\'guoxia221\',
25         \'oneMonth\':\'1\',
26     },
27     headers={
28         \'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\',
29     },
30     cookies = ret_cookie_dict
31 )
32 # print(response.text)
33 
34 #3、点赞
35 ret = requests.post(
36     url=\'https://dig.chouti.com/link/vote?linksId=20636346\',
37     headers={
38         \'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\'
39     },
40     cookies=ret_cookie_dict
41 )
42 print(ret.text)

View Code

给抽屉的指定页面点赞

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 #1、先访问抽屉新热榜，获取未授权的cookie
 5 ret = requests.get(
 6     url=\'https://dig.chouti.com/all/hot/recent/1\',
 7     headers={
 8         \'User-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\',
 9     },
10 )
11 ret_cookie_dict = ret.cookies.get_dict()
12 
13 
14 
15 # 2、发送用户名和密码认证 + cookie（未授权）
16 
17 
18 #以下这是post的方式请求
19 response_login = requests.post(
20     url=\'https://dig.chouti.com/login\',
21     data={
22         \'phone\':\'8616601049889\',
23         \'jid\':\'国歌\',
24         \'password\':\'guoxia221\',
25         \'oneMonth\':\'1\',
26     },
27     headers={
28         \'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\',
29     },
30     cookies = ret_cookie_dict
31 )
32 
33 for page_num in range(2,3):#自己设定点赞的页面
34     response_index = requests.get(
35         url=\'https://dig.chouti.com/all/hot/recent/1\',
36         headers={
37             \'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\',
38         },
39     )
40     # print(response_index.text)
41 
42     soup = BeautifulSoup(response_index.text,\'html.parser\')
43     div = soup.find(attrs={\'id\':\'content-list\'})
44     items = div.find_all(attrs={\'class\':\'item\'})
45 
46     for item in items:
47         tag = item.find(attrs={\'class\':\'part2\'})
48         nid = tag.get(\'share-linkid\')
49         # print(nid)
50 
51 
52         #3、点赞
53         ret = requests.post(
54             url=\'https://dig.chouti.com/link/vote?linksId=%s\' %nid,
55             headers={
56                 \'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\'
57             },
58             cookies=ret_cookie_dict
59         )
60         print(ret.text)

View Code

四、学习心得

这次学习爬虫是感触相对比较深，终于可以动手了。

跟着大佬武sir还是很能学到东西的。

1、首先是requests的学习，这个模块虽然以前用过，但是感觉没这次实用，这次用来，直接出干货。还是挺爽的。当然这个模块还有很多参数需要在实践中不断的记忆，这次用的多的就是get和post，理解比以前深了很多。

2、beautifulsoup4模块的使用，学习这个东西的时候，记住特别深的一句话就是：不建议用正则，直接用别人写好的模块，效率会高很多。用的感受就是：谁用谁知道。当然通过查资料，包括参考别人的分享，这个模块还是有很多方法需要学的，一时间也记不住那么多。只能在以后用的时候不断记忆了。

3、学习爬虫，思路很关键，就是分析问题的能力。这玩意儿分析不透就解决不了问题。听了大佬的分享他分析问题的过程，就想，何时能到那水平。。。。。（此处省略n多字）

4、爬取汽车之家的时候，很简单，没有设置反爬措施，直接就可以按照自己想要的爬下来了。后来爬取抽屉时，设了反爬措施，需要带一个cookie，然后带了cookie也不行，需要先访问一遍，然后带着未授权的cookie，登录，获得授权后的cookie，才能爬下来。如果能多遇到不同的反爬措施，然后攻克，应该就学习的很快了。

5、还是多练，不练光听，就是光输入，不输出，不是自己的东西。