qiangayz

一、什么是网络爬虫?

  网络爬虫,是一种按照一定规则,自动的抓取万维网信息的程序或者脚本。

二、python网络爬虫,

  需要用到的第三方包 requests和BeautifulSoup4

  pip install requests

  pip install BeautifulSoup4 

  常用方法总结:

response = requests.get(\'URL\') #获取网
response.text     #文本内容(字符串
response.content  #文件内容,比如图
response.encoding  #设置编
response.aperant_encoding  #显示下载时候的编
response.status_code #状态码
response.cookies.get_dict()
requests.get(\'http://www.autohome.com.cn/news/\',cookie={\'xx\':\'xxx\'})
  

  beautifulsoup4模块  

soup = BeautifulSoup(\'htmlstr\',features=\'html.parser\')
v1 = soup.find(\'div\')
v1 = soup.find(id = \'i1\')
v1 = soup.find(\'div\',id = \'i1\')

v2 = soup.find_all(\'div\')
v2 = soup.find_all(id = \'i1\')
v2 = soup.find_all(\'div\',id = \'i1\')
v1.text  #字符串
v1.attr #属性
#v2是个列表
v2[0].attr

三、初始demo

import requests
from bs4 import BeautifulSoup
response = requests.get(url = \'https://www.autohome.com.cn/news/\') #下载页面
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text,features=\'html.parser\') #创建Beautisoup对象
target = soup.find(id=\'auto-channel-lazyload-article\') #找到新闻栏
#print(target)
li_list = target.find_all(\'li\')
for i in li_list:
    a = i.find(\'a\')
    if a:
        print(a.attrs.get(\'href\'))
        txt = a.find(\'h3\').text
        imagurl = a.find(\'img\').attrs.get(\'src\')
        print(imagurl)

        img_response = requests.get(url = \'https:\'+imagurl)
        import uuid
        file_name = str(uuid.uuid4())+\'.jpg\'
        with open(file_name,"wb") as f:
            f.write(img_response.content)

                           

四、抽屉登录并点赞

\'\'\'
抽屉小套路,用户认证的cookie不是登录用户密码返回的cookie
而是第一次get返回的cookie,然后登陆的时候把这个cookie带过去进行授权操作
\'\'\'
import requests


headers = {
    \'user-agent\': \'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36\'
}
post_data = {
    \'phone\':\'8615191481351\',
    \'password\':\'11111111\',
    \'oneMonth\':1
}
ret1 = requests.get(
    url = \'https://dig.chouti.com\',
    headers = headers
)
cookie1 = ret1.cookies.get_dict()
print(cookie1)

ret2 = requests.post(
    url = \'https://dig.chouti.com/login\',
    data = post_data,
    headers = headers,
    cookies = cookie1
)
cookie2 = ret2.cookies.get_dict()
print(cookie2)

ret3 = requests.post(
    url = \'https://dig.chouti.com/link/vote?linksId=21910661\',
    cookies = {
        \'gpsd\':cookie1[\'gpsd\']
        #\'gpsd\': \'f59363bb59b30fe7126b38756c6e5680\'
    },
    headers = headers
)
print(ret3.text)

ret = requests.post(
    url = \'https://dig.chouti.com/vote/cancel/vote.do\',
    cookies = {
        \'gpsd\': cookie1[\'gpsd\']
    },
    data = {\'linksId\': 21910661},
    headers = headers
)
print(ret.text)

  

更多关于request参数的介绍:http://www.cnblogs.com/wupeiqi/articles/6283017.html

                 

分类:

技术点:

相关文章: