利用pyquery爬取并保存小说

最近在复习以前学过的pyquery，正好在看《剑来》（当然不会说是后者居多了.....），毕竟没有需求就没有提高嘛！b话不多说。

目标：

获取小说每个章节的url
因为每页只有40章，所以需要遍历出所有章节
根据每个章节的url获取相对应的文章内容
保存

获取章节url

利用pyquery爬取并保存小说

图片有点问题，大家将就着看就行，也就是瞅瞅看章节的url在哪个节点下而已

利用pyquery爬取并保存小说

接下来分析下一章怎么遍历，一共有两种方法：

第一种就是获取a标签的href属性的值并与原url做拼接；第二种就是看a标签的href值是否有规律，然后用一个for循环遍历出来。本文章选的是第二种方法，因为第一种还得获取a标签的href属性，忒麻烦了点。

该部分代码如下：

# disguised header
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) '
                        'AppleWeKit/537.36(KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
# Access to the directory
def get_directory():
    """
    get the directory url
    :return: list of url
    """
    lis = []
    for i in range(15):
        url = 'https://m.80txt.com/6194/page-{}.html'.format(i+2)
        html = requests.get(url, headers=headers)
        doc = pq(html.content.decode('utf-8'))
        items = doc('.book_last dl dd a')
        for item in items:
            lis.append(item.get('href'))
    return lis[:]

获取文章内容

同样的，先分析text在哪个位置：

利用pyquery爬取并保存小说

可以看到，在id为nr1的div标签下，所以我们就可以获取该div的文本内容了，当然，为了一些阅读器能够智能分章，我们还得把章节名搞下来：

利用pyquery爬取并保存小说

章节名在id为_bqgmb_h1的h1中，直接提取就行。

该部分代码如下：

# get content
def get_content(url):
    """
    get the name and the content of article
    :param url: 'https://m.80txt.com/6194' + str(url)
    :return: name, article
    """
    url = 'https://m.80txt.com/6194' + str(url)
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        html = response.content.decode('utf-8')
        doc = pq(html)
        name = pq(doc('#_bqgmb_h1')).text()
        article = pq(doc('#nr #nr1')).text()
    return name, article

保存

代码：

# save
def write_to_file(content):
    """
    save
    :param content: get_content()
    :return: None
    """
    # if two strings are returned, they from a tuple
    for i in content:
        with open('剑来_1.txt', 'a', encoding='utf-8') as f:
            f.write('\n'+i)

Notice:

因为上边获取文章内容和章节名时return的是name、article两个值，python会自动将两个值组合成一个元组，而write()函数的参数不能是元组，所以必须将元组遍历出来，即：

for i in content:

主函数

最后再写个主函数：

# main
def main():
    for url in get_directory():
        content = get_content(url)
        write_to_file(content)
        time.sleep(random.randint(1, 5))

这里我time.sleep(random.randint(1, 5))了一下，就是为了防止请求过于频繁使自己的ip被该网站禁封了。

完事！

看小说是不会购买的，爬就完了！毕竟技术无罪嘛！

源码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2019/3/23 10:23
# @Site    : E:\code
# @File    : jianlai.py
# @Software: PyCharm

import requests
from pyquery import PyQuery as pq
import time
import random
# disguised header
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) '
                        'AppleWeKit/537.36(KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
# Access to the directory
def get_directory():
    """
    get the directory url
    :return: list of url
    """
    lis = []
    for i in range(15):
        url = 'https://m.80txt.com/6194/page-{}.html'.format(i+2)
        html = requests.get(url, headers=headers)
        doc = pq(html.content.decode('utf-8'))
        items = doc('.book_last dl dd a')
        for item in items:
            lis.append(item.get('href'))
    return lis[:]
# get content
def get_content(url):
    """
    get the name and the content of article
    :param url: 'https://m.80txt.com/6194' + str(url)
    :return: name, article
    """
    url = 'https://m.80txt.com/6194' + str(url)
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        html = response.content.decode('utf-8')
        doc = pq(html)
        name = pq(doc('#_bqgmb_h1')).text()
        article = pq(doc('#nr #nr1')).text()
    return name, article
# save
def write_to_file(content):
    """
    save
    :param content: get_content()
    :return: None
    """
    # if two strings are returned, they from a tuple
    for i in content:
        with open('剑来_1.txt', 'a', encoding='utf-8') as f:
            f.write('\n'+i)

# main
def main():
    for url in get_directory():
        content = get_content(url)
        write_to_file(content)
        time.sleep(random.randint(1, 5))
# run
if __name__ == '__main__':
    main()

纯属娱乐，若有错误，请各位大佬斧正！筱君谢谢各位啦！